├── 07_recurrent_neural_networks
    ├── recurrent_neural_networks.ipynb
    └── README.md
├── 08_transformers_and_attention_mechanisms
    ├── transformers_and_attention_mechanisms.ipynb
    └── README.md
├── 11_pytorch_lightning
    └── README.md
├── 09_generative_models
    └── README.md
├── 10_model_deployment
    └── README.md
├── 03_automatic_differentiation
    ├── automatic_differentiation.ipynb
    └── README.md
├── requirements.txt
├── LICENSE
├── 13_custom_extensions
    ├── README.md
    └── custom_extensions.py
├── 19_neural_architecture_search
    └── README.md
├── 17_model_optimization_techniques
    └── README.md
├── 14_performance_optimization
    ├── README.md
    └── performance_optimization.py
├── 15_advanced_model_architectures
    └── README.md
├── 16_reinforcement_learning
    └── README.md
├── 20_bayesian_deep_learning
    └── README.md
├── 18_meta_learning
    └── README.md
├── 21_advanced_research_topics
    └── README.md
├── 05_data_loading_preprocessing
    ├── data_loading_preprocessing.ipynb
    └── README.md
├── README.md
├── 01_pytorch_basics
    ├── pytorch_basics.py
    └── README.md
├── 12_distributed_training
    ├── README.md
    └── distributed_training.py
├── 06_convolutional_neural_networks
    └── README.md
├── 02_neural_networks_fundamentals
    └── README.md
└── 04_training_neural_networks
    └── README.md


/07_recurrent_neural_networks/recurrent_neural_networks.ipynb:
--------------------------------------------------------------------------------
1 |  


--------------------------------------------------------------------------------
/08_transformers_and_attention_mechanisms/transformers_and_attention_mechanisms.ipynb:
--------------------------------------------------------------------------------
1 |  


--------------------------------------------------------------------------------
/11_pytorch_lightning/README.md:
--------------------------------------------------------------------------------
1 | # PyTorch Lightning
2 | 
3 | This section will cover PyTorch Lightning.
4 | 
5 | ## Contents
6 | - Lightning modules
7 | - Trainers and callbacks
8 | - Multi-GPU training
9 | - Experiment logging 


--------------------------------------------------------------------------------
/09_generative_models/README.md:
--------------------------------------------------------------------------------
1 | # Generative Models
2 | 
3 | This section will cover generative models in PyTorch.
4 | 
5 | ## Contents
6 | - Autoencoders
7 | - Variational Autoencoders (VAEs)
8 | - Generative Adversarial Networks (GANs)
9 | - Diffusion models 


--------------------------------------------------------------------------------
/10_model_deployment/README.md:
--------------------------------------------------------------------------------
 1 | # Model Deployment
 2 | 
 3 | This section will cover model deployment in PyTorch.
 4 | 
 5 | ## Contents
 6 | - TorchScript and tracing
 7 | - ONNX export
 8 | - Quantization
 9 | - Mobile deployment (PyTorch Mobile)
10 | - Web deployment (ONNX.js) 


--------------------------------------------------------------------------------
/03_automatic_differentiation/automatic_differentiation.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "metadata": {},
 6 |    "source": [
 7 |     "# Automatic Differentiation with PyTorch Autograd\n",
 8 |     "\n",
 9 |     "This notebook provides a detailed introduction to PyTorch's Autograd system, covering automatic differentiation concepts and practical implementation."
10 |    ]
11 |   }
12 |  ],
13 |  "metadata": {
14 |   "language_info": {
15 |    "name": "python"
16 |   }
17 |  },
18 |  "nbformat": 4,
19 |  "nbformat_minor": 2
20 | }
21 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | # Core PyTorch Dependencies
 2 | torch>=2.0.0
 3 | torchvision>=0.15.0
 4 | torchaudio>=2.0.0
 5 | 
 6 | # Data Science and Visualization
 7 | numpy>=1.20.0
 8 | matplotlib>=3.5.0
 9 | seaborn>=0.12.0
10 | pandas>=1.3.0
11 | scikit-learn>=1.0.0
12 | 
13 | # Jupyter Notebook Support
14 | jupyter>=1.0.0
15 | ipykernel>=6.0.0
16 | notebook>=6.4.0
17 | 
18 | # PyTorch Ecosystem
19 | pytorch-lightning>=2.0.0
20 | torchmetrics>=0.10.0
21 | torchtext>=0.15.0
22 | 
23 | # Deep Learning Libraries
24 | transformers>=4.20.0
25 | timm>=0.6.0
26 | 
27 | # Computer Vision
28 | pillow>=9.0.0
29 | opencv-python>=4.6.0
30 | albumentations>=1.3.0
31 | 
32 | # Natural Language Processing
33 | nltk>=3.7.0
34 | spacy>=3.4.0
35 | regex>=2022.3.15
36 | 
37 | # Model Export and Deployment
38 | onnx>=1.12.0
39 | onnxruntime>=1.12.0
40 | 
41 | # Training Utilities
42 | tqdm>=4.62.0
43 | tensorboard>=2.10.0
44 | wandb>=0.13.0
45 | 
46 | # Additional Utilities
47 | requests>=2.25.0
48 | scipy>=1.7.0


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2025 Nicolai Høirup Nielsen
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/13_custom_extensions/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 13: Custom Extensions (C++ and CUDA)
 2 | 
 3 | ## Overview
 4 | This tutorial covers how to extend PyTorch with custom C++ and CUDA operations for performance-critical applications. You'll learn how to write, compile, and integrate custom extensions into your PyTorch workflows.
 5 | 
 6 | ## Contents
 7 | - Understanding when to use custom extensions
 8 | - Writing C++ extensions
 9 | - Creating CUDA kernels
10 | - Building and packaging extensions
11 | - JIT compilation vs ahead-of-time compilation
12 | - Debugging custom extensions
13 | 
14 | ## Learning Objectives
15 | - Write custom C++ operations for PyTorch
16 | - Create CUDA kernels for GPU acceleration
17 | - Build and integrate extensions into PyTorch
18 | - Debug and optimize custom operations
19 | - Understand memory management in extensions
20 | 
21 | ## Prerequisites
22 | - Strong understanding of PyTorch fundamentals
23 | - Basic C++ knowledge
24 | - CUDA programming basics (for GPU extensions)
25 | - Understanding of PyTorch's autograd system
26 | 
27 | ## Key Concepts
28 | 1. **PyTorch Extension API**: Interface for creating custom operations
29 | 2. **Tensor Memory Layout**: Understanding contiguous memory and strides
30 | 3. **Autograd Integration**: Making custom ops work with automatic differentiation
31 | 4. **CUDA Kernels**: Writing GPU-accelerated operations
32 | 5. **Build Systems**: Using setuptools and JIT compilation
33 | 
34 | ## Practical Applications
35 | - Performance-critical operations
36 | - Novel layer implementations
37 | - Custom optimizers
38 | - Specialized data structures
39 | - Hardware-specific optimizations
40 | 
41 | ## Next Steps
42 | After completing this tutorial, you'll be able to create high-performance custom operations that seamlessly integrate with PyTorch's ecosystem.


--------------------------------------------------------------------------------
/19_neural_architecture_search/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 19: Neural Architecture Search
 2 | 
 3 | ## Overview
 4 | This tutorial explores Neural Architecture Search (NAS) techniques for automatically designing optimal neural network architectures. You'll learn about different search strategies, search spaces, and performance estimation methods, with practical implementations in PyTorch.
 5 | 
 6 | ## Contents
 7 | - Introduction to NAS concepts and motivation
 8 | - Search space design
 9 | - Random search and grid search
10 | - Evolutionary algorithms for NAS
11 | - Differentiable architecture search (DARTS)
12 | - Efficient NAS techniques
13 | - Performance estimation strategies
14 | 
15 | ## Learning Objectives
16 | - Understand the NAS problem formulation
17 | - Design effective search spaces
18 | - Implement various search strategies
19 | - Build differentiable architecture search
20 | - Apply early stopping and performance prediction
21 | - Evaluate and compare architectures
22 | 
23 | ## Prerequisites
24 | - Strong PyTorch and deep learning knowledge
25 | - Understanding of various network architectures
26 | - Basic knowledge of optimization algorithms
27 | - Familiarity with computational graphs
28 | 
29 | ## Key Concepts
30 | 1. **Search Space**: Set of possible architectures
31 | 2. **Search Strategy**: Algorithm to explore the space
32 | 3. **Performance Estimation**: Evaluating architectures efficiently
33 | 4. **Supernet**: Weight-sharing approaches
34 | 5. **Architecture Encoding**: Representing architectures
35 | 
36 | ## Practical Applications
37 | - AutoML systems
38 | - Hardware-specific optimization
39 | - Domain-specific architecture design
40 | - Model compression
41 | - Multi-objective optimization
42 | - Efficient model discovery
43 | 
44 | ## Next Steps
45 | After this tutorial, you'll be able to implement NAS techniques to automatically discover optimal architectures for your specific tasks and constraints.


--------------------------------------------------------------------------------
/17_model_optimization_techniques/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 17: Model Optimization Techniques
 2 | 
 3 | ## Overview
 4 | This tutorial covers advanced model optimization techniques including quantization, pruning, knowledge distillation, and neural architecture search. You'll learn how to make models smaller, faster, and more efficient for deployment while maintaining accuracy.
 5 | 
 6 | ## Contents
 7 | - Model quantization (INT8, dynamic, static)
 8 | - Network pruning (structured and unstructured)
 9 | - Knowledge distillation
10 | - Model compression techniques
11 | - Efficient inference optimization
12 | - Hardware-aware optimization
13 | - Deployment considerations
14 | 
15 | ## Learning Objectives
16 | - Implement various quantization schemes
17 | - Apply pruning to reduce model size
18 | - Use knowledge distillation for model compression
19 | - Optimize models for specific hardware
20 | - Balance accuracy vs efficiency trade-offs
21 | - Deploy optimized models effectively
22 | 
23 | ## Prerequisites
24 | - Strong PyTorch fundamentals
25 | - Understanding of neural network architectures
26 | - Basic knowledge of computer architecture
27 | - Familiarity with model training
28 | 
29 | ## Key Concepts
30 | 1. **Quantization**: Reducing numerical precision
31 | 2. **Pruning**: Removing unnecessary parameters
32 | 3. **Distillation**: Transferring knowledge to smaller models
33 | 4. **Compression**: Reducing model size and complexity
34 | 5. **Hardware Optimization**: Tailoring models for specific devices
35 | 
36 | ## Practical Applications
37 | - Mobile and edge deployment
38 | - Real-time inference systems
39 | - Resource-constrained environments
40 | - Cloud cost optimization
41 | - Embedded AI systems
42 | - IoT applications
43 | 
44 | ## Next Steps
45 | After this tutorial, you'll be able to optimize PyTorch models for production deployment, significantly reducing their computational requirements while maintaining performance.


--------------------------------------------------------------------------------
/14_performance_optimization/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 14: Performance Optimization
 2 | 
 3 | ## Overview
 4 | This tutorial covers comprehensive performance optimization techniques for PyTorch models, from basic profiling to advanced optimization strategies. You'll learn how to identify bottlenecks and apply various optimization techniques to improve training and inference speed.
 5 | 
 6 | ## Contents
 7 | - Profiling PyTorch models
 8 | - Memory optimization techniques
 9 | - Mixed precision training
10 | - Data loading optimization
11 | - Model parallelism and distributed training
12 | - Kernel fusion and graph optimization
13 | - Hardware-specific optimizations
14 | 
15 | ## Learning Objectives
16 | - Profile and identify performance bottlenecks
17 | - Optimize memory usage and reduce memory fragmentation
18 | - Implement mixed precision training effectively
19 | - Speed up data loading pipelines
20 | - Apply model and data parallelism
21 | - Use TorchScript for production optimization
22 | 
23 | ## Prerequisites
24 | - Strong understanding of PyTorch fundamentals
25 | - Experience training neural networks
26 | - Basic understanding of GPU architecture
27 | - Familiarity with Python profiling tools
28 | 
29 | ## Key Concepts
30 | 1. **Profiling**: Using PyTorch profiler to identify bottlenecks
31 | 2. **Memory Management**: Efficient tensor allocation and deallocation
32 | 3. **Mixed Precision**: Using FP16/BF16 for faster computation
33 | 4. **Data Pipeline**: Optimizing data loading and preprocessing
34 | 5. **Parallelism**: Distributing computation across devices
35 | 
36 | ## Practical Applications
37 | - Large-scale model training
38 | - Real-time inference systems
39 | - Mobile and edge deployment
40 | - Cloud-based ML services
41 | - Research experiments at scale
42 | 
43 | ## Next Steps
44 | After completing this tutorial, you'll be equipped to optimize PyTorch models for various deployment scenarios and achieve significant performance improvements.


--------------------------------------------------------------------------------
/15_advanced_model_architectures/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 15: Advanced Model Architectures
 2 | 
 3 | ## Overview
 4 | This tutorial explores cutting-edge neural network architectures including Graph Neural Networks (GNNs), Vision Transformers (ViT), Neural Architecture Search (NAS) results, and other state-of-the-art models. You'll learn to implement and train these advanced architectures for various tasks.
 5 | 
 6 | ## Contents
 7 | - Graph Neural Networks (GCN, GAT, GraphSAGE)
 8 | - Vision Transformers and variants
 9 | - EfficientNet and compound scaling
10 | - Neural ODEs
11 | - Capsule Networks
12 | - Self-supervised learning architectures
13 | - Multimodal architectures
14 | 
15 | ## Learning Objectives
16 | - Implement Graph Neural Networks for graph-structured data
17 | - Build Vision Transformers from scratch
18 | - Understand and apply efficient model scaling
19 | - Work with continuous-time neural networks
20 | - Implement advanced attention mechanisms
21 | - Design architectures for multimodal learning
22 | 
23 | ## Prerequisites
24 | - Strong understanding of CNNs and Transformers
25 | - Familiarity with graph theory basics
26 | - Experience with PyTorch modules and autograd
27 | - Understanding of attention mechanisms
28 | 
29 | ## Key Concepts
30 | 1. **Graph Convolutions**: Learning on graph-structured data
31 | 2. **Patch Embeddings**: Converting images to sequences for transformers
32 | 3. **Compound Scaling**: Efficiently scaling model dimensions
33 | 4. **Neural ODEs**: Continuous-depth models
34 | 5. **Routing Mechanisms**: Dynamic computation paths
35 | 
36 | ## Practical Applications
37 | - Social network analysis
38 | - Molecular property prediction
39 | - Large-scale image classification
40 | - Video understanding
41 | - Multimodal AI systems
42 | - Scientific computing
43 | 
44 | ## Next Steps
45 | After this tutorial, you'll be equipped to implement and adapt state-of-the-art architectures for your specific use cases and research.


--------------------------------------------------------------------------------
/16_reinforcement_learning/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 16: Reinforcement Learning
 2 | 
 3 | ## Overview
 4 | This tutorial introduces reinforcement learning (RL) with PyTorch, covering fundamental algorithms and modern deep RL techniques. You'll learn to implement agents that learn through interaction with environments, from basic Q-learning to advanced policy gradient methods.
 5 | 
 6 | ## Contents
 7 | - RL fundamentals and terminology
 8 | - Deep Q-Networks (DQN) and variants
 9 | - Policy gradient methods (REINFORCE, A2C, PPO)
10 | - Actor-Critic architectures
11 | - Multi-agent reinforcement learning
12 | - Model-based RL approaches
13 | - Practical training tips and tricks
14 | 
15 | ## Learning Objectives
16 | - Understand core RL concepts (MDPs, value functions, policies)
17 | - Implement DQN with experience replay and target networks
18 | - Build policy gradient algorithms from scratch
19 | - Create efficient Actor-Critic agents
20 | - Handle continuous action spaces
21 | - Debug and visualize RL training
22 | 
23 | ## Prerequisites
24 | - Strong PyTorch fundamentals
25 | - Basic understanding of probability and statistics
26 | - Familiarity with neural network training
27 | - Knowledge of optimization techniques
28 | 
29 | ## Key Concepts
30 | 1. **Markov Decision Processes**: Mathematical framework for RL
31 | 2. **Value Functions**: Estimating future rewards
32 | 3. **Policy Optimization**: Learning optimal behavior directly
33 | 4. **Exploration vs Exploitation**: Balancing learning and performance
34 | 5. **Experience Replay**: Efficient use of past experiences
35 | 
36 | ## Practical Applications
37 | - Game playing AI
38 | - Robotics control
39 | - Resource management
40 | - Trading strategies
41 | - Autonomous navigation
42 | - Recommendation systems
43 | 
44 | ## Next Steps
45 | After this tutorial, you'll be able to implement and train RL agents for various tasks, understand the trade-offs between different algorithms, and apply RL to real-world problems.


--------------------------------------------------------------------------------
/20_bayesian_deep_learning/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 20: Bayesian Deep Learning
 2 | 
 3 | ## Overview
 4 | This tutorial explores Bayesian approaches to deep learning, focusing on uncertainty quantification and probabilistic modeling. You'll learn how to build neural networks that can express uncertainty in their predictions, implement various Bayesian neural network techniques, and understand when and why to use them.
 5 | 
 6 | ## Contents
 7 | - Introduction to Bayesian deep learning
 8 | - Monte Carlo Dropout for uncertainty estimation
 9 | - Bayesian Neural Networks with variational inference
10 | - Deep ensembles
11 | - Gaussian processes and neural networks
12 | - Uncertainty calibration
13 | - Applications and best practices
14 | 
15 | ## Learning Objectives
16 | - Understand uncertainty in neural networks
17 | - Implement Monte Carlo Dropout
18 | - Build Bayesian neural networks with PyTorch
19 | - Create and train deep ensembles
20 | - Quantify and calibrate uncertainty
21 | - Apply Bayesian methods to real problems
22 | 
23 | ## Prerequisites
24 | - Strong understanding of deep learning
25 | - Basic probability and statistics
26 | - Familiarity with Bayesian inference concepts
27 | - PyTorch fundamentals
28 | 
29 | ## Key Concepts
30 | 1. **Epistemic Uncertainty**: Model uncertainty
31 | 2. **Aleatoric Uncertainty**: Data uncertainty
32 | 3. **Variational Inference**: Approximate Bayesian inference
33 | 4. **Posterior Distribution**: Distribution over model parameters
34 | 5. **Predictive Uncertainty**: Uncertainty in predictions
35 | 
36 | ## Practical Applications
37 | - Medical diagnosis with confidence estimates
38 | - Autonomous driving safety
39 | - Financial risk assessment
40 | - Active learning
41 | - Out-of-distribution detection
42 | - Robust decision making
43 | 
44 | ## Next Steps
45 | After this tutorial, you'll be able to build neural networks that know what they don't know, crucial for safety-critical applications and informed decision-making.


--------------------------------------------------------------------------------
/18_meta_learning/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 18: Meta-Learning and Few-Shot Learning
 2 | 
 3 | ## Overview
 4 | This tutorial explores meta-learning (learning to learn) and few-shot learning techniques in PyTorch. You'll learn how to build models that can quickly adapt to new tasks with minimal training data, including implementations of MAML, Prototypical Networks, and other state-of-the-art approaches.
 5 | 
 6 | ## Contents
 7 | - Introduction to meta-learning concepts
 8 | - Model-Agnostic Meta-Learning (MAML)
 9 | - Prototypical Networks
10 | - Matching Networks
11 | - Reptile algorithm
12 | - Few-shot classification and regression
13 | - Applications and best practices
14 | 
15 | ## Learning Objectives
16 | - Understand the meta-learning paradigm
17 | - Implement MAML for fast adaptation
18 | - Build Prototypical Networks for few-shot classification
19 | - Create Matching Networks with attention
20 | - Apply meta-learning to real problems
21 | - Evaluate few-shot learning performance
22 | 
23 | ## Prerequisites
24 | - Strong PyTorch and deep learning knowledge
25 | - Understanding of gradient-based optimization
26 | - Familiarity with classification tasks
27 | - Basic knowledge of attention mechanisms
28 | 
29 | ## Key Concepts
30 | 1. **Meta-Learning**: Learning algorithms that improve with experience
31 | 2. **Few-Shot Learning**: Learning from very few examples
32 | 3. **Task Distribution**: Learning over distributions of tasks
33 | 4. **Fast Adaptation**: Quick learning on new tasks
34 | 5. **Episodic Training**: Training on task episodes
35 | 
36 | ## Practical Applications
37 | - Medical diagnosis with limited data
38 | - Personalized recommendation systems
39 | - Robotics and control
40 | - Drug discovery
41 | - Rare event detection
42 | - Language understanding for low-resource languages
43 | 
44 | ## Next Steps
45 | After this tutorial, you'll be able to implement meta-learning algorithms for scenarios with limited data and build systems that can quickly adapt to new tasks.


--------------------------------------------------------------------------------
/21_advanced_research_topics/README.md:
--------------------------------------------------------------------------------
 1 | # Tutorial 21: Advanced Research Topics
 2 | 
 3 | ## Overview
 4 | This tutorial explores cutting-edge research topics in deep learning, including neural ODEs, implicit neural representations, self-supervised learning, contrastive learning, and other emerging areas. You'll learn about the latest advances in the field and how to implement them using PyTorch.
 5 | 
 6 | ## Contents
 7 | - Neural Ordinary Differential Equations (Neural ODEs)
 8 | - Implicit Neural Representations (NeRF, SIREN)
 9 | - Self-supervised learning methods
10 | - Contrastive learning (SimCLR, MoCo)
11 | - Diffusion models basics
12 | - Transformer variants and improvements
13 | - Emerging architectures and techniques
14 | 
15 | ## Learning Objectives
16 | - Understand neural ODEs and continuous-depth models
17 | - Implement implicit neural representations
18 | - Master self-supervised learning techniques
19 | - Build contrastive learning systems
20 | - Explore diffusion models
21 | - Understand recent transformer innovations
22 | - Apply cutting-edge techniques to real problems
23 | 
24 | ## Prerequisites
25 | - Strong foundation in deep learning
26 | - Experience with PyTorch
27 | - Understanding of advanced architectures
28 | - Familiarity with research papers
29 | - Mathematical maturity
30 | 
31 | ## Key Concepts
32 | 1. **Neural ODEs**: Continuous-depth neural networks
33 | 2. **Implicit Representations**: Coordinate-based neural networks
34 | 3. **Self-Supervised Learning**: Learning without labels
35 | 4. **Contrastive Learning**: Learning representations through contrasts
36 | 5. **Diffusion Models**: Generative models via denoising
37 | 6. **Attention Mechanisms**: Advanced transformer techniques
38 | 
39 | ## Practical Applications
40 | - 3D scene reconstruction
41 | - Representation learning
42 | - Few-shot learning
43 | - Image generation
44 | - Video understanding
45 | - Scientific computing
46 | - Robotics and control
47 | 
48 | ## Next Steps
49 | After this tutorial, you'll be equipped to:
50 | - Read and implement research papers
51 | - Contribute to open-source projects
52 | - Conduct your own research
53 | - Apply state-of-the-art methods
54 | - Push the boundaries of deep learning


--------------------------------------------------------------------------------
/05_data_loading_preprocessing/data_loading_preprocessing.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Data Loading, Preprocessing, and Augmentation in PyTorch\n",
  8 |     "\n",
  9 |     "This notebook provides a comprehensive guide to efficiently loading, preprocessing, and augmenting data in PyTorch. Effective data handling is critical for any machine learning pipeline."
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import torch\n",
 19 |     "import torch.nn as nn\n",
 20 |     "import torch.optim as optim\n",
 21 |     "import torchvision\n",
 22 |     "import torchvision.transforms as transforms\n",
 23 |     "import matplotlib.pyplot as plt\n",
 24 |     "import numpy as np\n",
 25 |     "from torch.utils.data import Dataset, DataLoader, random_split\n",
 26 |     "from PIL import Image\n",
 27 |     "import os\n",
 28 |     "import pandas as pd\n",
 29 |     "from pathlib import Path\n",
 30 |     "import glob\n",
 31 |     "import time\n",
 32 |     "\n",
 33 |     "# Set random seed for reproducibility\n",
 34 |     "torch.manual_seed(42)\n",
 35 |     "np.random.seed(42)\n",
 36 |     "\n",
 37 |     "# Device configuration\n",
 38 |     "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
 39 |     "print(f\"Using device: {device}\")\n",
 40 |     "print(f\"PyTorch version: {torch.__version__}\")\n",
 41 |     "\n",
 42 |     "# Create output directory\n",
 43 |     "output_dir = \"05_data_loading_preprocessing_outputs\"\n",
 44 |     "os.makedirs(output_dir, exist_ok=True)"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "markdown",
 49 |    "metadata": {},
 50 |    "source": [
 51 |     "## 1. Introduction to Data Handling\n",
 52 |     "\n",
 53 |     "Data loading and preprocessing are critical steps in any machine learning pipeline:\n",
 54 |     "\n",
 55 |     "- **Loading:** Reading data from various sources (files, databases)\n",
 56 |     "- **Preprocessing:** Cleaning, transforming, and structuring data\n",
 57 |     "- **Augmentation:** Artificially expanding the dataset for better generalization"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": null,
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "# Demonstrate built-in datasets\n",
 67 |     "print(\"Using Built-in Datasets:\")\n",
 68 |     "print(\"-\" * 30)\n",
 69 |     "\n",
 70 |     "# Load MNIST dataset\n",
 71 |     "mnist_dataset = torchvision.datasets.MNIST(\n",
 72 |     "    root='./data', \n",
 73 |     "    train=True, \n",
 74 |     "    download=True, \n",
 75 |     "    transform=transforms.ToTensor()\n",
 76 |     ")\n",
 77 |     "\n",
 78 |     "print(f\"Dataset size: {len(mnist_dataset)}\")\n",
 79 |     "sample, label = mnist_dataset[0]\n",
 80 |     "print(f\"Sample shape: {sample.shape}\")\n",
 81 |     "print(f\"Sample dtype: {sample.dtype}\")\n",
 82 |     "print(f\"Label: {label}\")\n",
 83 |     "\n",
 84 |     "# Visualize a sample\n",
 85 |     "plt.figure(figsize=(8, 4))\n",
 86 |     "plt.subplot(1, 2, 1)\n",
 87 |     "plt.imshow(sample.squeeze(), cmap='gray')\n",
 88 |     "plt.title(f'MNIST Sample (Label: {label})')\n",
 89 |     "plt.axis('off')\n",
 90 |     "\n",
 91 |     "# Show multiple samples\n",
 92 |     "plt.subplot(1, 2, 2)\n",
 93 |     "fig, axes = plt.subplots(2, 3, figsize=(6, 4))\n",
 94 |     "for i, ax in enumerate(axes.flat):\n",
 95 |     "    if i < 6:\n",
 96 |     "        img, lbl = mnist_dataset[i]\n",
 97 |     "        ax.imshow(img.squeeze(), cmap='gray')\n",
 98 |     "        ax.set_title(f'Label: {lbl}')\n",
 99 |     "        ax.axis('off')\n",
100 |     "plt.tight_layout()\n",
101 |     "plt.show()"
102 |    ]
103 |   }
104 |  ],
105 |  "metadata": {
106 |   "language_info": {
107 |    "name": "python"
108 |   }
109 |  },
110 |  "nbformat": 4,
111 |  "nbformat_minor": 2
112 | }
113 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # PyTorch Tutorials
  2 | 
  3 | A comprehensive collection of PyTorch tutorials from beginner to expert level. This repository aims to provide practical, hands-on examples and explanations for various PyTorch concepts and applications.
  4 | 
  5 | ## 🚀 Quick Start
  6 | 
  7 | ### Installation
  8 | ```bash
  9 | git clone https://github.com/niconielsen32/pytorch-tutorials.git
 10 | cd pytorch-tutorials
 11 | pip install -r requirements.txt
 12 | ```
 13 | 
 14 | ### Running the Tutorials
 15 | ```bash
 16 | # Run Python scripts directly
 17 | python 01_pytorch_basics/pytorch_basics.py
 18 | 
 19 | # Or use Jupyter notebooks for interactive learning
 20 | jupyter notebook
 21 | # Then navigate to any tutorial folder and open the .ipynb file
 22 | ```
 23 | 
 24 | ## 📚 Table of Contents
 25 | 
 26 | ### **Fundamentals**
 27 | 
 28 | #### Beginner Level
 29 | 
 30 | 1. **[PyTorch Basics](01_pytorch_basics/)**
 31 |    - Tensors, operations, and computational graphs
 32 |    - NumPy integration
 33 |    - GPU acceleration
 34 |    - Basic autograd operations
 35 | 
 36 | 2. **[Neural Networks Fundamentals](02_neural_networks_fundamentals/)**
 37 |    - Linear layers, activation functions, loss functions, optimizers
 38 |    - Building your first neural network
 39 |    - Forward and backward propagation
 40 |    - nn.Module and nn.Sequential
 41 | 
 42 | 3. **[Automatic Differentiation](03_automatic_differentiation/)**
 43 |    - Autograd mechanics
 44 |    - Computing gradients
 45 |    - Custom autograd functions
 46 |    - Higher-order derivatives
 47 | 
 48 | 4. **[Training Neural Networks](04_training_neural_networks/)**
 49 |    - Training loop implementation
 50 |    - Validation techniques
 51 |    - Hyperparameter tuning
 52 |    - Learning rate scheduling
 53 |    - Early stopping
 54 | 
 55 | 5. **[Data Loading and Preprocessing](05_data_loading_preprocessing/)**
 56 |    - Dataset and DataLoader classes
 57 |    - Custom datasets
 58 |    - Data transformations and augmentation
 59 |    - Efficient data loading techniques
 60 |    - Batch processing
 61 | 
 62 | ### **Computer Vision**
 63 | 
 64 | #### Intermediate Level
 65 | 
 66 | 6. **[Convolutional Neural Networks](06_convolutional_neural_networks/)**
 67 |    - CNN architecture components
 68 |    - Convolution, pooling, and fully connected layers
 69 |    - Image classification with CNNs
 70 |    - Transfer learning with pre-trained models
 71 |    - Feature visualization
 72 | 
 73 | #### Advanced Computer Vision Applications
 74 | - Object detection (YOLO, R-CNN)
 75 | - Semantic segmentation
 76 | - Instance segmentation
 77 | - Image generation
 78 | - Style transfer
 79 | 
 80 | ### **Natural Language Processing**
 81 | 
 82 | 7. **[Recurrent Neural Networks](07_recurrent_neural_networks/)**
 83 |    - RNN architecture
 84 |    - LSTM and GRU implementations
 85 |    - Sequence modeling
 86 |    - Text classification
 87 |    - Text generation
 88 |    - Time series forecasting
 89 | 
 90 | 8. **[Transformers and Attention Mechanisms](08_transformers_and_attention_mechanisms/)**
 91 |    - Self-attention and multi-head attention
 92 |    - Transformer architecture
 93 |    - BERT and GPT model implementations
 94 |    - Fine-tuning pre-trained transformers
 95 |    - Positional encoding
 96 | 
 97 | ### **Advanced Topics**
 98 | 
 99 | #### Advanced Level
100 | 
101 | 9. **[Generative Models](09_generative_models/)**
102 |    - Autoencoders
103 |    - Variational Autoencoders (VAEs)
104 |    - Generative Adversarial Networks (GANs)
105 |    - Diffusion models
106 |    - Style transfer
107 | 
108 | 10. **[Model Deployment](10_model_deployment/)**
109 |     - TorchScript and tracing
110 |     - ONNX export
111 |     - Quantization techniques
112 |     - Mobile deployment (PyTorch Mobile)
113 |     - Web deployment (ONNX.js)
114 |     - Model serving
115 | 
116 | 11. **[PyTorch Lightning](11_pytorch_lightning/)**
117 |     - Lightning modules
118 |     - Trainers and callbacks
119 |     - Multi-GPU training
120 |     - Experiment logging
121 |     - Hyperparameter tuning with Lightning
122 | 
123 | 12. **[Distributed Training](12_distributed_training/)**
124 |     - Data Parallel (DP) for single-machine multi-GPU
125 |     - Distributed Data Parallel (DDP) for multi-node training
126 |     - Model Parallel for large models
127 |     - Pipeline Parallelism for deep networks
128 |     - Fully Sharded Data Parallel (FSDP) for extreme scale
129 | 
130 | ### **Additional Advanced Topics**
131 | 
132 | 13. **[Custom Extensions](13_custom_extensions/)**
133 |     - C++ extensions for custom operations
134 |     - CUDA kernels for GPU acceleration
135 |     - Custom autograd functions
136 |     - JIT compilation with TorchScript
137 |     - Binding C++/CUDA code to Python
138 | 
139 | 14. **[Performance Optimization](14_performance_optimization/)**
140 |     - Memory optimization techniques
141 |     - Mixed precision training with AMP
142 |     - Profiling and benchmarking
143 |     - Data loading optimization
144 |     - Gradient accumulation and checkpointing
145 | 
146 | 15. **[Advanced Model Architectures](15_advanced_model_architectures/)**
147 |     - Graph Neural Networks (GNNs)
148 |     - Vision Transformers (ViT)
149 |     - EfficientNet and compound scaling
150 |     - Neural ODEs
151 |     - Capsule Networks
152 | 
153 | 16. **[Reinforcement Learning](16_reinforcement_learning/)**
154 |     - Deep Q-Networks (DQN)
155 |     - Policy gradient methods (REINFORCE)
156 |     - Actor-Critic and A2C
157 |     - Proximal Policy Optimization (PPO)
158 |     - Integration with OpenAI Gym
159 | 
160 | 17. **[Model Optimization Techniques](17_model_optimization_techniques/)**
161 |     - Quantization (dynamic and static)
162 |     - Pruning (structured and unstructured)
163 |     - Knowledge distillation
164 |     - Model compression
165 |     - Hardware-aware optimization
166 | 
167 | 18. **[Meta-Learning and Few-Shot Learning](18_meta_learning/)**
168 |     - Model-Agnostic Meta-Learning (MAML)
169 |     - Prototypical Networks
170 |     - Matching Networks
171 |     - Reptile algorithm
172 |     - Few-shot classification tasks
173 | 
174 | ### **Expert Level Topics**
175 | 
176 | 19. **[Neural Architecture Search](19_neural_architecture_search/)**
177 |     - Random search and grid search
178 |     - Evolutionary algorithms
179 |     - Differentiable Architecture Search (DARTS)
180 |     - Efficient Neural Architecture Search (ENAS)
181 |     - Performance prediction
182 | 
183 | 20. **[Bayesian Deep Learning](20_bayesian_deep_learning/)**
184 |     - Bayesian Neural Networks
185 |     - Variational inference
186 |     - Monte Carlo Dropout
187 |     - Deep ensembles
188 |     - Uncertainty quantification
189 | 
190 | 21. **[Advanced Research Topics](21_advanced_research_topics/)**
191 |     - Self-supervised learning (SimCLR, BYOL)
192 |     - Contrastive learning methods
193 |     - Diffusion models
194 |     - Neural Radiance Fields (NeRF)
195 |     - Implicit neural representations
196 | 
197 | ## 📋 Each Tutorial Includes
198 | 
199 | - **📖 README.md** - Detailed theory and concepts
200 | - **🐍 Python Script** - Complete runnable code with comments
201 | - **📓 Jupyter Notebook** - Interactive step-by-step learning
202 | 
203 | ## 🛠️ Requirements
204 | 
205 | - Python 3.8+
206 | - PyTorch 2.0+
207 | - torchvision
208 | - torchaudio (for audio tutorials)
209 | - matplotlib
210 | - numpy
211 | - pandas
212 | - scikit-learn
213 | - Jupyter Notebook/Lab
214 | 
215 | You can install the required packages using:
216 | ```bash
217 | pip install -r requirements.txt
218 | ```
219 | 
220 | ## 📖 How to Use This Repository
221 | 
222 | 1. **Sequential Learning**: Follow the tutorials in order for a comprehensive learning experience
223 | 2. **Topic-Based**: Jump to specific topics based on your interests and needs
224 | 3. **Practice**: Each tutorial contains exercises and examples
225 | 4. **Experiment**: Modify the code and experiment with different parameters
226 | 
227 | ### Getting Started
228 | 
229 | 1. **Start with the README** in each folder for theoretical background
230 | 2. **Run the Python script** to see the complete implementation
231 | 3. **Open the Jupyter notebook** for interactive learning and experimentation
232 | 
233 | ## 🤝 Contributing
234 | 
235 | Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
236 | 
237 | ## 📄 License
238 | 
239 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
240 | 
241 | ## 🙏 Acknowledgments
242 | 
243 | - PyTorch team for the amazing framework
244 | - The deep learning community for continuous innovation
245 | - All contributors to this repository
246 | 
247 | ---
248 | 
249 | Perfect for both beginners starting their PyTorch journey and experts looking to deepen their understanding of advanced topics!


--------------------------------------------------------------------------------
/08_transformers_and_attention_mechanisms/README.md:
--------------------------------------------------------------------------------
  1 | # Transformers and Attention Mechanisms
  2 | 
  3 | This tutorial delves into Transformers and Attention Mechanisms, pivotal concepts in modern deep learning, especially for natural language processing and beyond.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Attention Mechanisms](#introduction-to-attention-mechanisms)
  7 |    - What is Attention?
  8 |    - Types of Attention (Bahdanau, Luong)
  9 | 2. [Self-Attention](#self-attention)
 10 |    - Concept and Motivation
 11 |    - Scaled Dot-Product Attention
 12 | 3. [Multi-Head Attention](#multi-head-attention)
 13 |    - Purpose and Architecture
 14 |    - Implementation Details
 15 | 4. [The Transformer Architecture](#the-transformer-architecture)
 16 |    - Encoder-Decoder Structure
 17 |    - Positional Encoding
 18 |    - Feed-Forward Networks
 19 |    - Layer Normalization and Residual Connections
 20 | 5. [Building a Transformer Block](#building-a-transformer-block)
 21 |    - Encoder Block
 22 |    - Decoder Block
 23 | 6. [Applications of Transformers](#applications-of-transformers)
 24 |    - Natural Language Processing (e.g., Translation, Summarization)
 25 |    - Vision Transformers (ViT)
 26 | 7. [Implementing a Simple Transformer with PyTorch](#implementing-a-simple-transformer-with-pytorch)
 27 |    - Step-by-step guide
 28 | 8. [Pre-trained Transformer Models (BERT, GPT)](#pre-trained-transformer-models)
 29 |    - Overview of popular models
 30 |    - Using Hugging Face Transformers library
 31 | 9. [Fine-tuning Pre-trained Transformers](#fine-tuning-pre-trained-transformers)
 32 |    - Concepts and techniques
 33 |    - Example: Text classification
 34 | 
 35 | ## Introduction to Attention Mechanisms
 36 | 
 37 | Attention mechanisms in deep learning are inspired by human visual attention – the ability to focus on specific parts of an image while perceiving the whole. In the context of neural networks, attention allows a model to dynamically focus on different parts of the input sequence when producing an output.
 38 | 
 39 | - **What is Attention?**
 40 |   - A mechanism that allows the model to assign different weights (importance scores) to different parts of the input.
 41 |   - Helps in handling long sequences and capturing long-range dependencies.
 42 | - **Types of Attention:**
 43 |   - **Bahdanau Attention (Additive Attention):** Uses a feed-forward network to compute alignment scores.
 44 |   - **Luong Attention (Multiplicative Attention):** Uses dot-product based alignment scores.
 45 | 
 46 | ## Self-Attention
 47 | 
 48 | Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It is a key component of Transformers.
 49 | 
 50 | - **Concept and Motivation:**
 51 |   - Allows the model to weigh the importance of other words in the *same* sentence when encoding a particular word.
 52 |   - Example: "The animal didn't cross the street because **it** was too tired." Self-attention helps determine if "it" refers to "animal" or "street".
 53 | - **Scaled Dot-Product Attention:**
 54 |   - The core of self-attention.
 55 |   - Queries (Q), Keys (K), and Values (V) are computed from the input embeddings.
 56 |   - Attention Score = `softmax((Q * K^T) / sqrt(d_k)) * V`
 57 |   - `d_k` is the dimension of the key vectors, used for scaling to prevent overly small gradients.
 58 | 
 59 | ## Multi-Head Attention
 60 | 
 61 | Instead of performing a single attention function, Multi-Head Attention runs multiple attention mechanisms in parallel and concatenates their outputs.
 62 | 
 63 | - **Purpose and Architecture:**
 64 |   - Allows the model to jointly attend to information from different representation subspaces at different positions.
 65 |   - Each "head" can learn different aspects of the input.
 66 |   - Input Q, K, V are linearly projected `h` times with different, learned linear projections.
 67 |   - Attention is applied by each head in parallel.
 68 |   - Outputs are concatenated and linearly projected again.
 69 | 
 70 | ## The Transformer Architecture
 71 | 
 72 | The Transformer model, introduced in "Attention Is All You Need," relies entirely on attention mechanisms, dispensing with recurrence and convolutions.
 73 | 
 74 | - **Encoder-Decoder Structure:**
 75 |   - **Encoder:** Maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations (z1, ..., zn). Composed of a stack of N identical layers.
 76 |   - **Decoder:** Given z, generates an output sequence (y1, ..., ym) one symbol at a time. Also composed of a stack of N identical layers. The decoder incorporates an additional multi-head attention over the output of the encoder stack.
 77 | - **Positional Encoding:**
 78 |   - Since Transformers contain no recurrence or convolution, positional encodings are added to the input embeddings to give the model information about the relative or absolute position of tokens in the sequence.
 79 |   - Sine and cosine functions of different frequencies are typically used.
 80 | - **Feed-Forward Networks:**
 81 |   - Each layer in the encoder and decoder contains a fully connected feed-forward network, applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
 82 | - **Layer Normalization and Residual Connections:**
 83 |   - Each sub-layer (self-attention, feed-forward network) in the encoder and decoder has a residual connection around it, followed by layer normalization.
 84 | 
 85 | ## Building a Transformer Block
 86 | 
 87 | - **Encoder Block:**
 88 |   - Multi-Head Self-Attention layer
 89 |   - Add & Norm (Residual Connection + Layer Normalization)
 90 |   - Position-wise Feed-Forward Network
 91 |   - Add & Norm
 92 | - **Decoder Block:**
 93 |   - Masked Multi-Head Self-Attention layer (to prevent attending to future positions)
 94 |   - Add & Norm
 95 |   - Multi-Head Attention (over encoder output)
 96 |   - Add & Norm
 97 |   - Position-wise Feed-Forward Network
 98 |   - Add & Norm
 99 | 
100 | ## Applications of Transformers
101 | 
102 | - **Natural Language Processing (NLP):**
103 |   - Machine Translation (original application)
104 |   - Text Summarization
105 |   - Question Answering
106 |   - Sentiment Analysis
107 |   - Text Generation
108 | - **Vision Transformers (ViT):**
109 |   - Apply Transformer architecture directly to sequences of image patches for image classification.
110 | 
111 | ## Implementing a Simple Transformer with PyTorch
112 | 
113 | This section will provide code examples for building the core components of a Transformer, such as Scaled Dot-Product Attention, Multi-Head Attention, Positional Encoding, and a basic Encoder-Decoder structure using PyTorch.
114 | 
115 | ```python
116 | import torch
117 | import torch.nn as nn
118 | import math
119 | 
120 | # Example: Scaled Dot-Product Attention
121 | class ScaledDotProductAttention(nn.Module):
122 |     def forward(self, query, key, value, mask=None):
123 |         dk = query.size(-1)
124 |         scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(dk)
125 |         if mask is not None:
126 |             scores = scores.masked_fill(mask == 0, -1e9)
127 |         attention = torch.softmax(scores, dim=-1)
128 |         return torch.matmul(attention, value)
129 | 
130 | # Further components like MultiHeadAttention, PositionalEncoding, EncoderLayer, DecoderLayer will be shown.
131 | ```
132 | 
133 | ## Pre-trained Transformer Models (BERT, GPT)
134 | 
135 | - **BERT (Bidirectional Encoder Representations from Transformers):**
136 |   - Developed by Google.
137 |   - Designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
138 |   - Used for tasks like question answering, language inference.
139 | - **GPT (Generative Pre-trained Transformer):**
140 |   - Developed by OpenAI.
141 |   - Uses a decoder-only transformer architecture.
142 |   - Excels at text generation tasks.
143 | - **Hugging Face Transformers Library:**
144 |   - Provides thousands of pre-trained models for a wide range of tasks in NLP, vision, and audio.
145 |   - Simplifies downloading and using state-of-the-art models.
146 | 
147 | ```python
148 | # Example using Hugging Face Transformers
149 | # from transformers import BertTokenizer, BertModel
150 | # tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
151 | # model = BertModel.from_pretrained('bert-base-uncased')
152 | # inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
153 | # outputs = model(**inputs)
154 | # last_hidden_states = outputs.last_hidden_state
155 | ```
156 | 
157 | ## Fine-tuning Pre-trained Transformers
158 | 
159 | - **Concepts:** Instead of training a large model from scratch, take a pre-trained model and adapt it to a specific downstream task using a smaller, task-specific dataset.
160 | - **Techniques:**
161 |   - Add a task-specific layer (e.g., a classification head) on top of the pre-trained model.
162 |   - Unfreeze some of the top layers of the pre-trained model and train them with the task-specific layer.
163 |   - Or, unfreeze and train the entire model, but with a much smaller learning rate.
164 | 
165 | ## Running the Tutorial
166 | 
167 | To run the Python script associated with this tutorial:
168 | ```bash
169 | python transformers_and_attention_mechanisms.py
170 | ```
171 | Alternatively, you can follow along with the Jupyter notebook `transformers_and_attention_mechanisms.ipynb` for an interactive experience.
172 | 
173 | ## Prerequisites
174 | - Python 3.7+
175 | - PyTorch 1.10+
176 | - (Optionally) Hugging Face Transformers library: `pip install transformers`
177 | 
178 | ## Next Steps
179 | Explore building and training a full Transformer model for a specific task, or dive deeper into the mathematics and variations of attention mechanisms. 


--------------------------------------------------------------------------------
/01_pytorch_basics/pytorch_basics.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | PyTorch Basics
  6 | 
  7 | This script provides an introduction to PyTorch, covering tensors, operations,
  8 | and basic computational graphs.
  9 | """
 10 | 
 11 | import torch
 12 | import numpy as np
 13 | import matplotlib.pyplot as plt
 14 | 
 15 | # Set random seed for reproducibility
 16 | torch.manual_seed(42)
 17 | 
 18 | # Device configuration
 19 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 20 | print(f"Using device: {device}")
 21 | 
 22 | # -----------------------------------------------------------------------------
 23 | # Section 1: Introduction to PyTorch
 24 | # -----------------------------------------------------------------------------
 25 | 
 26 | def intro_to_pytorch():
 27 |     """Introduce basic PyTorch concepts and features."""
 28 |     print("Introduction to PyTorch")
 29 |     print("-" * 50)
 30 |     print("PyTorch is an open-source machine learning library for Python.")
 31 |     print("Key features include:")
 32 |     print("  - Tensor computation with strong GPU acceleration")
 33 |     print("  - Dynamic neural networks")
 34 |     print("  - Automatic differentiation for deep learning")
 35 |     print(f"PyTorch version: {torch.__version__}")
 36 |     if torch.cuda.is_available():
 37 |         print(f"CUDA version: {torch.version.cuda}")
 38 | 
 39 | # -----------------------------------------------------------------------------
 40 | # Section 2: Tensors
 41 | # -----------------------------------------------------------------------------
 42 | 
 43 | def demonstrate_tensors():
 44 |     """Demonstrate tensor creation and properties."""
 45 |     print("\nTensors in PyTorch")
 46 |     print("-" * 50)
 47 |     
 48 |     # Creating tensors
 49 |     tensor_1d = torch.tensor([1, 2, 3, 4, 5])
 50 |     tensor_2d = torch.tensor([[1, 2, 3], [4, 5, 6]])
 51 |     print("1D Tensor:", tensor_1d)
 52 |     print("2D Tensor:\n", tensor_2d)
 53 |     
 54 |     # Tensor properties
 55 |     print("\nTensor Properties:")
 56 |     print(f"Shape of 1D tensor: {tensor_1d.shape}")
 57 |     print(f"Shape of 2D tensor: {tensor_2d.shape}")
 58 |     print(f"Data type of 1D tensor: {tensor_1d.dtype}")
 59 |     print(f"Device of 1D tensor: {tensor_1d.device}")
 60 |     
 61 |     # Different initialization methods
 62 |     zeros_tensor = torch.zeros(3, 3)
 63 |     ones_tensor = torch.ones(2, 4)
 64 |     random_tensor = torch.randn(2, 3)
 65 |     print("\nInitialization Methods:")
 66 |     print("Zeros Tensor:\n", zeros_tensor)
 67 |     print("Ones Tensor:\n", ones_tensor)
 68 |     print("Random Tensor:\n", random_tensor)
 69 |     
 70 |     # Converting data types
 71 |     float_tensor = tensor_1d.float()
 72 |     int_tensor = tensor_1d.int()
 73 |     print("\nType Conversion:")
 74 |     print(f"Original dtype: {tensor_1d.dtype}")
 75 |     print(f"Float tensor dtype: {float_tensor.dtype}")
 76 |     print(f"Int tensor dtype: {int_tensor.dtype}")
 77 | 
 78 | # -----------------------------------------------------------------------------
 79 | # Section 3: Tensor Operations
 80 | # -----------------------------------------------------------------------------
 81 | 
 82 | def demonstrate_tensor_operations():
 83 |     """Demonstrate various tensor operations."""
 84 |     print("\nTensor Operations")
 85 |     print("-" * 50)
 86 |     
 87 |     # Create sample tensors
 88 |     a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float)
 89 |     b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float)
 90 |     
 91 |     # Element-wise operations
 92 |     add_result = a + b
 93 |     mul_result = a * b
 94 |     print("Element-wise Operations:")
 95 |     print("Addition:\n", add_result)
 96 |     print("Multiplication:\n", mul_result)
 97 |     
 98 |     # Matrix operations
 99 |     matmul_result = torch.matmul(a, b)
100 |     transpose = a.t()
101 |     print("\nMatrix Operations:")
102 |     print("Matrix Multiplication:\n", matmul_result)
103 |     print("Transpose of a:\n", transpose)
104 |     
105 |     # Reshaping
106 |     reshape_result = a.view(4, 1)
107 |     print("\nReshaping:")
108 |     print("Original shape:", a.shape)
109 |     print("Reshaped tensor shape:", reshape_result.shape)
110 |     print("Reshaped tensor:\n", reshape_result)
111 |     
112 |     # Indexing
113 |     element = a[0, 1]
114 |     row = a[0, :]
115 |     print("\nIndexing:")
116 |     print("Element at [0,1]:", element)
117 |     print("First row:", row)
118 |     
119 |     # Broadcasting
120 |     scalar = torch.tensor(2.0)
121 |     broadcast_result = a + scalar
122 |     print("\nBroadcasting:")
123 |     print("Original tensor:\n", a)
124 |     print("After adding scalar 2.0:\n", broadcast_result)
125 | 
126 | # -----------------------------------------------------------------------------
127 | # Section 4: NumPy Integration
128 | # -----------------------------------------------------------------------------
129 | 
130 | def demonstrate_numpy_integration():
131 |     """Demonstrate integration between PyTorch tensors and NumPy arrays."""
132 |     print("\nNumPy Integration")
133 |     print("-" * 50)
134 |     
135 |     # Convert NumPy array to tensor
136 |     np_array = np.array([[1, 2], [3, 4]])
137 |     tensor_from_np = torch.from_numpy(np_array)
138 |     print("NumPy array to Tensor:")
139 |     print("NumPy array:\n", np_array)
140 |     print("Tensor:\n", tensor_from_np)
141 |     
142 |     # Convert tensor to NumPy array
143 |     tensor = torch.tensor([[5, 6], [7, 8]])
144 |     np_from_tensor = tensor.numpy()
145 |     print("\nTensor to NumPy array:")
146 |     print("Tensor:\n", tensor)
147 |     print("NumPy array:\n", np_from_tensor)
148 |     
149 |     # Shared memory demonstration
150 |     print("\nShared Memory Demonstration:")
151 |     np_array[0, 0] = 99
152 |     print("Modified NumPy array:\n", np_array)
153 |     print("Tensor (shares memory):\n", tensor_from_np)
154 | 
155 | # -----------------------------------------------------------------------------
156 | # Section 5: GPU Acceleration
157 | # -----------------------------------------------------------------------------
158 | 
159 | def demonstrate_gpu_acceleration():
160 |     """Demonstrate GPU usage with PyTorch."""
161 |     print("\nGPU Acceleration")
162 |     print("-" * 50)
163 |     
164 |     if torch.cuda.is_available():
165 |         # Create tensor on CPU
166 |         cpu_tensor = torch.randn(1000, 1000)
167 |         print("CPU tensor device:", cpu_tensor.device)
168 |         
169 |         # Move tensor to GPU
170 |         gpu_tensor = cpu_tensor.to(device)
171 |         print("GPU tensor device:", gpu_tensor.device)
172 |         
173 |         # Perform operation on GPU
174 |         start_time = time.time()
175 |         result_gpu = torch.matmul(gpu_tensor, gpu_tensor)
176 |         gpu_time = time.time() - start_time
177 |         
178 |         # Perform same operation on CPU
179 |         start_time = time.time()
180 |         result_cpu = torch.matmul(cpu_tensor, cpu_tensor)
181 |         cpu_time = time.time() - start_time
182 |         
183 |         print(f"Matrix multiplication time on CPU: {cpu_time:.4f} seconds")
184 |         print(f"Matrix multiplication time on GPU: {gpu_time:.4f} seconds")
185 |         print(f"Speedup: {cpu_time/gpu_time:.2f}x")
186 |     else:
187 |         print("CUDA is not available. GPU demonstration skipped.")
188 |         print("To enable GPU acceleration, install CUDA and cuDNN.")
189 | 
190 | # -----------------------------------------------------------------------------
191 | # Section 6: Computational Graphs
192 | # -----------------------------------------------------------------------------
193 | 
194 | def demonstrate_computational_graphs():
195 |     """Demonstrate dynamic computational graphs and autograd."""
196 |     print("\nComputational Graphs")
197 |     print("-" * 50)
198 |     
199 |     # Create tensors with gradient tracking
200 |     x = torch.tensor(2.0, requires_grad=True)
201 |     y = torch.tensor(3.0, requires_grad=True)
202 |     
203 |     # Define a simple computation
204 |     z = x * y + x**2
205 |     print("Forward computation: z = x * y + x^2")
206 |     print(f"x = {x.item()}, y = {y.item()}")
207 |     print(f"z = {z.item()}")
208 |     
209 |     # Compute gradients
210 |     z.backward()
211 |     print("\nGradients:")
212 |     print(f"dz/dx = {x.grad.item()} (should be y + 2x = {y.item() + 2*x.item()})")
213 |     print(f"dz/dy = {y.grad.item()} (should be x = {x.item()})")
214 |     
215 |     # Demonstrate a more complex graph
216 |     a = torch.tensor(1.0, requires_grad=True)
217 |     b = torch.tensor(2.0, requires_grad=True)
218 |     c = a + b
219 |     d = c * a
220 |     e = d + b**2
221 |     print("\nMore complex graph: e = (a + b) * a + b^2")
222 |     e.backward()
223 |     print("Gradients:")
224 |     print(f"de/da = {a.grad.item()}")
225 |     print(f"de/db = {b.grad.item()}")
226 | 
227 | # -----------------------------------------------------------------------------
228 | # Main function to run all sections
229 | # -----------------------------------------------------------------------------
230 | 
231 | import time
232 | 
233 | def main():
234 |     """Main function to run all PyTorch basics tutorial sections."""
235 |     print("=" * 80)
236 |     print("PyTorch Basics Tutorial")
237 |     print("=" * 80)
238 |     
239 |     # Section 1: Introduction
240 |     intro_to_pytorch()
241 |     
242 |     # Section 2: Tensors
243 |     demonstrate_tensors()
244 |     
245 |     # Section 3: Tensor Operations
246 |     demonstrate_tensor_operations()
247 |     
248 |     # Section 4: NumPy Integration
249 |     demonstrate_numpy_integration()
250 |     
251 |     # Section 5: GPU Acceleration
252 |     demonstrate_gpu_acceleration()
253 |     
254 |     # Section 6: Computational Graphs
255 |     demonstrate_computational_graphs()
256 |     
257 |     print("\nTutorial complete!")
258 | 
259 | if __name__ == '__main__':
260 |     main()


--------------------------------------------------------------------------------
/12_distributed_training/README.md:
--------------------------------------------------------------------------------
  1 | # Distributed Training
  2 | 
  3 | This tutorial covers distributed training techniques in PyTorch, enabling you to scale your models across multiple GPUs and machines for faster training and larger model capacity.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Distributed Training](#introduction-to-distributed-training)
  7 | 2. [Data Parallel (DP)](#data-parallel-dp)
  8 | 3. [Distributed Data Parallel (DDP)](#distributed-data-parallel-ddp)
  9 | 4. [Model Parallel](#model-parallel)
 10 | 5. [Pipeline Parallelism](#pipeline-parallelism)
 11 | 6. [Fully Sharded Data Parallel (FSDP)](#fully-sharded-data-parallel-fsdp)
 12 | 
 13 | ## Introduction to Distributed Training
 14 | 
 15 | - Why distributed training?
 16 | - Types of parallelism
 17 | - Communication backends
 18 | - Hardware requirements
 19 | 
 20 | ## Data Parallel (DP)
 21 | 
 22 | - Single-machine multi-GPU training
 23 | - Automatic gradient averaging
 24 | - Limitations and performance considerations
 25 | - When to use DP vs DDP
 26 | 
 27 | ## Distributed Data Parallel (DDP)
 28 | 
 29 | - Multi-GPU and multi-node training
 30 | - Process groups and initialization
 31 | - Gradient synchronization
 32 | - Best practices for DDP
 33 | 
 34 | ## Model Parallel
 35 | 
 36 | - Splitting models across devices
 37 | - Forward and backward pass coordination
 38 | - Memory management
 39 | - Use cases for very large models
 40 | 
 41 | ## Pipeline Parallelism
 42 | 
 43 | - Micro-batch processing
 44 | - Pipeline stages
 45 | - Bubble overhead optimization
 46 | - Combining with data parallelism
 47 | 
 48 | ## Fully Sharded Data Parallel (FSDP)
 49 | 
 50 | - Sharding model parameters, gradients, and optimizer states
 51 | - Memory efficiency for large models
 52 | - Configuration options
 53 | - Performance tuning
 54 | 
 55 | ## Running the Tutorial
 56 | 
 57 | To run this tutorial:
 58 | 
 59 | ```bash
 60 | # Single GPU example
 61 | python distributed_training.py
 62 | 
 63 | # Multi-GPU DDP example
 64 | torchrun --nproc_per_node=2 distributed_training.py --distributed
 65 | 
 66 | # Multi-node example
 67 | torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=29500 distributed_training.py --distributed
 68 | ```
 69 | 
 70 | Alternatively, you can follow along with the Jupyter notebook `distributed_training.ipynb` for an interactive experience.
 71 | 
 72 | ## Prerequisites
 73 | 
 74 | - Python 3.7+
 75 | - PyTorch 1.10+
 76 | - Multiple GPUs (for multi-GPU examples)
 77 | - NCCL backend (for optimal performance)
 78 | 
 79 | ## Related Tutorials
 80 | 
 81 | 1. [Training Neural Networks](../04_training_neural_networks/README.md)
 82 | 2. [PyTorch Lightning](../11_pytorch_lightning/README.md)
 83 | 
 84 | ## Introduction to Distributed Training
 85 | 
 86 | Distributed training is essential for modern deep learning, allowing you to:
 87 | - **Reduce training time** by leveraging multiple GPUs/machines
 88 | - **Train larger models** that don't fit on a single GPU
 89 | - **Process larger batches** for better gradient estimates
 90 | 
 91 | ### Types of Parallelism
 92 | 
 93 | 1. **Data Parallelism**: Split data across devices, replicate model
 94 | 2. **Model Parallelism**: Split model across devices
 95 | 3. **Pipeline Parallelism**: Split model into stages processed sequentially
 96 | 4. **Hybrid Approaches**: Combine multiple parallelism strategies
 97 | 
 98 | ### Communication Backends
 99 | 
100 | PyTorch supports multiple backends for inter-process communication:
101 | - **NCCL** (recommended for GPUs): Optimized for NVIDIA GPUs
102 | - **Gloo**: CPU and GPU support, good for development
103 | - **MPI**: Message Passing Interface, requires separate installation
104 | 
105 | ## Data Parallel (DP)
106 | 
107 | DataParallel is the simplest way to use multiple GPUs on a single machine:
108 | 
109 | ```python
110 | import torch
111 | import torch.nn as nn
112 | 
113 | # Create model
114 | model = nn.Sequential(
115 |     nn.Linear(10, 100),
116 |     nn.ReLU(),
117 |     nn.Linear(100, 10)
118 | )
119 | 
120 | # Wrap with DataParallel
121 | if torch.cuda.device_count() > 1:
122 |     model = nn.DataParallel(model)
123 | model = model.to('cuda')
124 | 
125 | # Forward pass automatically uses all GPUs
126 | input = torch.randn(32, 10).to('cuda')
127 | output = model(input)
128 | ```
129 | 
130 | ### Limitations of DP
131 | 
132 | - Python GIL bottleneck
133 | - Imbalanced GPU memory usage
134 | - Lower performance compared to DDP
135 | - Single-machine only
136 | 
137 | ## Distributed Data Parallel (DDP)
138 | 
139 | DDP is the recommended approach for distributed training:
140 | 
141 | ```python
142 | import torch
143 | import torch.distributed as dist
144 | import torch.multiprocessing as mp
145 | from torch.nn.parallel import DistributedDataParallel as DDP
146 | 
147 | def setup(rank, world_size):
148 |     """Initialize the distributed environment."""
149 |     dist.init_process_group("nccl", rank=rank, world_size=world_size)
150 | 
151 | def cleanup():
152 |     """Clean up the distributed environment."""
153 |     dist.destroy_process_group()
154 | 
155 | def train(rank, world_size):
156 |     setup(rank, world_size)
157 |     
158 |     # Create model and move to GPU
159 |     model = nn.Sequential(
160 |         nn.Linear(10, 100),
161 |         nn.ReLU(),
162 |         nn.Linear(100, 10)
163 |     ).to(rank)
164 |     
165 |     # Wrap with DDP
166 |     ddp_model = DDP(model, device_ids=[rank])
167 |     
168 |     # Create data loader with DistributedSampler
169 |     dataset = YourDataset()
170 |     sampler = torch.utils.data.distributed.DistributedSampler(
171 |         dataset, num_replicas=world_size, rank=rank
172 |     )
173 |     dataloader = torch.utils.data.DataLoader(
174 |         dataset, batch_size=32, sampler=sampler
175 |     )
176 |     
177 |     # Training loop
178 |     optimizer = torch.optim.Adam(ddp_model.parameters())
179 |     for epoch in range(num_epochs):
180 |         sampler.set_epoch(epoch)  # Ensure different shuffling per epoch
181 |         for data, target in dataloader:
182 |             optimizer.zero_grad()
183 |             output = ddp_model(data.to(rank))
184 |             loss = loss_fn(output, target.to(rank))
185 |             loss.backward()
186 |             optimizer.step()
187 |     
188 |     cleanup()
189 | 
190 | # Launch distributed training
191 | if __name__ == "__main__":
192 |     world_size = torch.cuda.device_count()
193 |     mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
194 | ```
195 | 
196 | ### DDP Best Practices
197 | 
198 | 1. **Use DistributedSampler** to ensure each process gets different data
199 | 2. **Set random seeds** per process for reproducibility
200 | 3. **Synchronize metrics** across processes when needed
201 | 4. **Save checkpoints** from only one process (usually rank 0)
202 | 5. **Use gradient accumulation** for large effective batch sizes
203 | 
204 | ## Model Parallel
205 | 
206 | For models too large to fit on a single GPU:
207 | 
208 | ```python
209 | class ModelParallelNet(nn.Module):
210 |     def __init__(self):
211 |         super().__init__()
212 |         # Place different parts on different GPUs
213 |         self.layer1 = nn.Linear(10, 100).to('cuda:0')
214 |         self.layer2 = nn.Linear(100, 100).to('cuda:1')
215 |         self.layer3 = nn.Linear(100, 10).to('cuda:1')
216 |     
217 |     def forward(self, x):
218 |         x = self.layer1(x.to('cuda:0'))
219 |         x = self.layer2(x.to('cuda:1'))
220 |         x = self.layer3(x)
221 |         return x
222 | ```
223 | 
224 | ### Challenges with Model Parallel
225 | 
226 | - Device idle time during forward/backward
227 | - Complex implementation for arbitrary models
228 | - Communication overhead between devices
229 | 
230 | ## Pipeline Parallelism
231 | 
232 | Pipeline parallelism addresses idle time by processing micro-batches:
233 | 
234 | ```python
235 | from torch.distributed.pipeline.sync import Pipe
236 | 
237 | # Define sequential model
238 | model = nn.Sequential(
239 |     nn.Linear(10, 100),
240 |     nn.ReLU(),
241 |     nn.Linear(100, 100),
242 |     nn.ReLU(),
243 |     nn.Linear(100, 10)
244 | )
245 | 
246 | # Create pipeline (splits model into balanced stages)
247 | model = Pipe(model, balance=[2, 3], devices=['cuda:0', 'cuda:1'])
248 | 
249 | # Forward pass with micro-batches
250 | output = model(input)
251 | ```
252 | 
253 | ### Pipeline Parallelism Benefits
254 | 
255 | - Better GPU utilization
256 | - Automatic micro-batch scheduling
257 | - Can combine with data parallelism
258 | - Suitable for deep networks
259 | 
260 | ## Fully Sharded Data Parallel (FSDP)
261 | 
262 | FSDP enables training of extremely large models by sharding parameters:
263 | 
264 | ```python
265 | from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
266 | from torch.distributed.fsdp.wrap import wrap
267 | 
268 | class FSDPModel(nn.Module):
269 |     def __init__(self):
270 |         super().__init__()
271 |         self.layer1 = wrap(nn.Linear(10, 100))
272 |         self.layer2 = wrap(nn.Linear(100, 100))
273 |         self.layer3 = wrap(nn.Linear(100, 10))
274 |     
275 |     def forward(self, x):
276 |         x = torch.relu(self.layer1(x))
277 |         x = torch.relu(self.layer2(x))
278 |         return self.layer3(x)
279 | 
280 | # Wrap entire model with FSDP
281 | model = FSDP(FSDPModel())
282 | 
283 | # Training works as normal
284 | optimizer = torch.optim.Adam(model.parameters())
285 | for data, target in dataloader:
286 |     optimizer.zero_grad()
287 |     output = model(data)
288 |     loss = loss_fn(output, target)
289 |     loss.backward()
290 |     optimizer.step()
291 | ```
292 | 
293 | ### FSDP Configuration
294 | 
295 | ```python
296 | from torch.distributed.fsdp import (
297 |     FullyShardedDataParallel as FSDP,
298 |     MixedPrecision,
299 |     BackwardPrefetch,
300 |     ShardingStrategy,
301 | )
302 | 
303 | # Configure FSDP
304 | fsdp_config = {
305 |     "sharding_strategy": ShardingStrategy.FULL_SHARD,
306 |     "cpu_offload": CPUOffload(offload_params=True),
307 |     "mixed_precision": MixedPrecision(
308 |         param_dtype=torch.float16,
309 |         reduce_dtype=torch.float16,
310 |         buffer_dtype=torch.float16,
311 |     ),
312 |     "backward_prefetch": BackwardPrefetch.BACKWARD_PRE,
313 | }
314 | 
315 | model = FSDP(model, **fsdp_config)
316 | ```
317 | 
318 | ## Performance Optimization Tips
319 | 
320 | 1. **Profile your code** to identify bottlenecks
321 | 2. **Overlap computation and communication** when possible
322 | 3. **Use mixed precision training** for faster computation
323 | 4. **Tune batch sizes** for optimal GPU utilization
324 | 5. **Monitor GPU memory** and adjust accordingly
325 | 
326 | ## Common Pitfalls and Solutions
327 | 
328 | ### Hanging Processes
329 | - Ensure all processes execute the same number of collective operations
330 | - Use proper error handling and cleanup
331 | 
332 | ### Gradient Synchronization Issues
333 | - Verify all processes have the same model architecture
334 | - Check for conditional logic that might cause divergence
335 | 
336 | ### Memory Imbalance
337 | - Balance model partitioning for model parallel
338 | - Use gradient checkpointing for memory-intensive models
339 | 
340 | ## Monitoring and Debugging
341 | 
342 | ```python
343 | # Log only from main process
344 | if rank == 0:
345 |     print(f"Epoch {epoch}, Loss: {loss.item()}")
346 | 
347 | # Synchronize before timing
348 | dist.barrier()
349 | start_time = time.time()
350 | 
351 | # Use distributed.all_reduce for metrics
352 | dist.all_reduce(loss, op=dist.ReduceOp.AVG)
353 | ```
354 | 
355 | ## Conclusion
356 | 
357 | Distributed training is essential for modern deep learning. Key takeaways:
358 | - Use DDP for most multi-GPU scenarios
359 | - Consider FSDP for very large models
360 | - Combine strategies for optimal performance
361 | - Always profile and monitor your training
362 | 
363 | The next tutorials will explore more advanced optimization techniques and deployment strategies.


--------------------------------------------------------------------------------
/01_pytorch_basics/README.md:
--------------------------------------------------------------------------------
  1 | # PyTorch Basics
  2 | 
  3 | This tutorial covers the fundamental concepts of PyTorch, providing a foundation for deep learning applications.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to PyTorch](#introduction-to-pytorch)
  7 | 2. [Tensors](#tensors)
  8 | 3. [Tensor Operations](#tensor-operations)
  9 | 4. [NumPy Integration](#numpy-integration)
 10 | 5. [GPU Acceleration](#gpu-acceleration)
 11 | 6. [Computational Graphs](#computational-graphs)
 12 | 
 13 | ## Introduction to PyTorch
 14 | 
 15 | - Overview of PyTorch as a deep learning framework
 16 | - Key features and advantages
 17 | - Installation and setup instructions
 18 | 
 19 | ## Tensors
 20 | 
 21 | - Creating tensors
 22 | - Tensor types and shapes
 23 | - Tensor initialization methods
 24 | - Converting between data types
 25 | 
 26 | ## Tensor Operations
 27 | 
 28 | - Element-wise operations
 29 | - Matrix operations
 30 | - Reshaping and indexing
 31 | - Broadcasting
 32 | 
 33 | ## NumPy Integration
 34 | 
 35 | - Converting between PyTorch tensors and NumPy arrays
 36 | - Shared memory considerations
 37 | - Practical examples of integration
 38 | 
 39 | ## GPU Acceleration
 40 | 
 41 | - Checking GPU availability
 42 | - Moving tensors to GPU
 43 | - Basic operations on GPU
 44 | - Performance considerations
 45 | 
 46 | ## Computational Graphs
 47 | 
 48 | - Understanding dynamic computational graphs
 49 | - Graph visualization
 50 | - Basic autograd operations
 51 | 
 52 | ## Running the Tutorial
 53 | 
 54 | To run this tutorial:
 55 | 
 56 | ```bash
 57 | python pytorch_basics.py
 58 | ```
 59 | 
 60 | Alternatively, you can follow along with the Jupyter notebook `pytorch_basics.ipynb` for an interactive experience.
 61 | 
 62 | ## Prerequisites
 63 | 
 64 | - Python 3.7+
 65 | - PyTorch 1.10+
 66 | 
 67 | ## Related Tutorials
 68 | 
 69 | 1. [Neural Networks Fundamentals](../02_neural_networks_fundamentals/README.md)
 70 | 2. [Automatic Differentiation](../03_automatic_differentiation/README.md)
 71 | 
 72 | ## Introduction to PyTorch
 73 | 
 74 | PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides a flexible and intuitive interface for building and training neural networks. PyTorch is known for its dynamic computational graph, which allows for more flexible model development compared to static graph frameworks.
 75 | 
 76 | Key features of PyTorch include:
 77 | - Dynamic computational graph (define-by-run)
 78 | - Intuitive Python interface
 79 | - Seamless integration with Python data science stack
 80 | - GPU acceleration
 81 | - Rich ecosystem of tools and libraries
 82 | 
 83 | ## Tensors
 84 | 
 85 | Tensors are the fundamental data structure in PyTorch, similar to NumPy arrays but with additional capabilities like GPU acceleration and automatic differentiation. Tensors can represent scalars, vectors, matrices, and higher-dimensional data.
 86 | 
 87 | ### Creating Tensors
 88 | 
 89 | ```python
 90 | import torch
 91 | 
 92 | # Create a tensor from a Python list
 93 | x = torch.tensor([1, 2, 3, 4])
 94 | print(x)  # tensor([1, 2, 3, 4])
 95 | 
 96 | # Create a 2D tensor (matrix)
 97 | matrix = torch.tensor([[1, 2], [3, 4]])
 98 | print(matrix)
 99 | # tensor([[1, 2],
100 | #         [3, 4]])
101 | 
102 | # Create tensors with specific data types
103 | float_tensor = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
104 | int_tensor = torch.tensor([1, 2, 3], dtype=torch.int64)
105 | 
106 | # Create tensors with specific shapes
107 | zeros = torch.zeros(3, 4)  # 3x4 tensor of zeros
108 | ones = torch.ones(2, 3)    # 2x3 tensor of ones
109 | rand = torch.rand(2, 2)    # 2x2 tensor with random values from uniform distribution [0, 1)
110 | randn = torch.randn(2, 2)  # 2x2 tensor with random values from standard normal distribution
111 | 
112 | # Create a tensor with a specific range
113 | range_tensor = torch.arange(0, 10, step=1)  # tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
114 | linspace = torch.linspace(0, 1, steps=5)    # tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])
115 | 
116 | # Create an identity matrix
117 | eye = torch.eye(3)  # 3x3 identity matrix
118 | ```
119 | 
120 | ### Tensor Attributes
121 | 
122 | ```python
123 | x = torch.randn(3, 4, 5)
124 | 
125 | print(x.shape)      # torch.Size([3, 4, 5])
126 | print(x.size())     # torch.Size([3, 4, 5])
127 | print(x.dim())      # 3 (number of dimensions)
128 | print(x.dtype)      # torch.float32
129 | print(x.device)     # device(type='cpu')
130 | ```
131 | 
132 | ### Tensor Indexing and Slicing
133 | 
134 | ```python
135 | x = torch.tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
136 | 
137 | # Indexing
138 | print(x[0, 0])      # tensor(1)
139 | print(x[1, 2])      # tensor(6)
140 | 
141 | # Slicing
142 | print(x[:, 0])      # First column: tensor([1, 4, 7])
143 | print(x[1, :])      # Second row: tensor([4, 5, 6])
144 | print(x[0:2, 1:3])  # Sub-matrix: tensor([[2, 3], [5, 6]])
145 | 
146 | # Advanced indexing
147 | indices = torch.tensor([0, 2])
148 | print(x[indices])   # tensor([[1, 2, 3], [7, 8, 9]])
149 | 
150 | # Boolean indexing
151 | mask = x > 5
152 | print(mask)
153 | # tensor([[False, False, False],
154 | #         [False, False,  True],
155 | #         [ True,  True,  True]])
156 | print(x[mask])      # tensor([6, 7, 8, 9])
157 | ```
158 | 
159 | ## Tensor Operations
160 | 
161 | PyTorch provides a wide range of operations for manipulating tensors.
162 | 
163 | ### Arithmetic Operations
164 | 
165 | ```python
166 | a = torch.tensor([1, 2, 3])
167 | b = torch.tensor([4, 5, 6])
168 | 
169 | # Addition
170 | print(a + b)                # tensor([5, 7, 9])
171 | print(torch.add(a, b))      # tensor([5, 7, 9])
172 | 
173 | # Subtraction
174 | print(a - b)                # tensor([-3, -3, -3])
175 | print(torch.sub(a, b))      # tensor([-3, -3, -3])
176 | 
177 | # Multiplication (element-wise)
178 | print(a * b)                # tensor([4, 10, 18])
179 | print(torch.mul(a, b))      # tensor([4, 10, 18])
180 | 
181 | # Division (element-wise)
182 | print(a / b)                # tensor([0.2500, 0.4000, 0.5000])
183 | print(torch.div(a, b))      # tensor([0.2500, 0.4000, 0.5000])
184 | 
185 | # In-place operations (modifies the tensor)
186 | a.add_(b)                   # a becomes tensor([5, 7, 9])
187 | ```
188 | 
189 | ### Matrix Operations
190 | 
191 | ```python
192 | a = torch.tensor([[1, 2], [3, 4]])
193 | b = torch.tensor([[5, 6], [7, 8]])
194 | 
195 | # Matrix multiplication
196 | print(torch.matmul(a, b))
197 | # tensor([[19, 22],
198 | #         [43, 50]])
199 | 
200 | print(a @ b)  # @ operator for matrix multiplication
201 | # tensor([[19, 22],
202 | #         [43, 50]])
203 | 
204 | # Element-wise multiplication
205 | print(a * b)
206 | # tensor([[ 5, 12],
207 | #         [21, 32]])
208 | 
209 | # Transpose
210 | print(a.t())
211 | # tensor([[1, 3],
212 | #         [2, 4]])
213 | 
214 | # Determinant
215 | print(torch.det(a))  # tensor(-2.)
216 | 
217 | # Inverse
218 | print(torch.inverse(a))
219 | # tensor([[-2.0000,  1.0000],
220 | #         [ 1.5000, -0.5000]])
221 | ```
222 | 
223 | ### Reduction Operations
224 | 
225 | ```python
226 | x = torch.tensor([[1, 2, 3], [4, 5, 6]])
227 | 
228 | # Sum
229 | print(torch.sum(x))         # tensor(21)
230 | print(x.sum())              # tensor(21)
231 | print(x.sum(dim=0))         # Sum along rows: tensor([5, 7, 9])
232 | print(x.sum(dim=1))         # Sum along columns: tensor([6, 15])
233 | 
234 | # Mean
235 | print(torch.mean(x.float()))  # tensor(3.5000)
236 | print(x.float().mean())       # tensor(3.5000)
237 | 
238 | # Max and Min
239 | print(torch.max(x))         # tensor(6)
240 | print(x.max())              # tensor(6)
241 | print(x.max(dim=0))         # Max along rows: (values=tensor([4, 5, 6]), indices=tensor([1, 1, 1]))
242 | print(x.min())              # tensor(1)
243 | 
244 | # Product
245 | print(torch.prod(x))        # tensor(720)
246 | ```
247 | 
248 | ### Reshaping Operations
249 | 
250 | ```python
251 | x = torch.tensor([[1, 2, 3], [4, 5, 6]])
252 | 
253 | # Reshape
254 | print(x.reshape(3, 2))
255 | # tensor([[1, 2],
256 | #         [3, 4],
257 | #         [5, 6]])
258 | 
259 | # View (shares the same data with the original tensor)
260 | print(x.view(6, 1))
261 | # tensor([[1],
262 | #         [2],
263 | #         [3],
264 | #         [4],
265 | #         [5],
266 | #         [6]])
267 | 
268 | # Flatten
269 | print(x.flatten())          # tensor([1, 2, 3, 4, 5, 6])
270 | 
271 | # Permute dimensions
272 | y = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])  # Shape: (2, 2, 2)
273 | print(y.permute(2, 0, 1))   # Permute to shape (2, 2, 2)
274 | 
275 | # Squeeze and Unsqueeze
276 | z = torch.tensor([[[1], [2]]])  # Shape: (1, 2, 1)
277 | print(z.squeeze())          # Remove dimensions of size 1: tensor([1, 2])
278 | print(z.squeeze(0))         # Remove dimension 0 if it's size 1: tensor([[1], [2]])
279 | print(torch.unsqueeze(x, 0))  # Add dimension at position 0: shape becomes (1, 2, 3)
280 | ```
281 | 
282 | ## NumPy Integration
283 | 
284 | PyTorch provides seamless integration with NumPy, allowing you to convert between PyTorch tensors and NumPy arrays.
285 | 
286 | ```python
287 | import numpy as np
288 | 
289 | # Convert NumPy array to PyTorch tensor
290 | np_array = np.array([1, 2, 3])
291 | tensor = torch.from_numpy(np_array)
292 | print(tensor)  # tensor([1, 2, 3])
293 | 
294 | # Convert PyTorch tensor to NumPy array
295 | tensor = torch.tensor([4, 5, 6])
296 | np_array = tensor.numpy()
297 | print(np_array)  # array([4, 5, 6])
298 | 
299 | # Note: If the tensor is on CPU, the tensor and the NumPy array share the same memory
300 | # Changes to one will affect the other
301 | np_array = np.array([1, 2, 3])
302 | tensor = torch.from_numpy(np_array)
303 | np_array[0] = 5
304 | print(tensor)  # tensor([5, 2, 3])
305 | 
306 | # This doesn't work for tensors on GPU
307 | ```
308 | 
309 | ## GPU Acceleration
310 | 
311 | One of the key features of PyTorch is its ability to leverage GPU acceleration for faster computations.
312 | 
313 | ```python
314 | # Check if CUDA (NVIDIA GPU) is available
315 | print(torch.cuda.is_available())  # True if CUDA is available
316 | 
317 | # Create a tensor on GPU
318 | if torch.cuda.is_available():
319 |     device = torch.device("cuda")
320 |     x = torch.tensor([1, 2, 3], device=device)
321 |     # or
322 |     y = torch.tensor([4, 5, 6]).to(device)
323 |     
324 |     # Move tensor back to CPU
325 |     z = y.cpu()
326 | else:
327 |     device = torch.device("cpu")
328 |     x = torch.tensor([1, 2, 3])  # Default is CPU
329 | 
330 | # Check which device a tensor is on
331 | print(x.device)  # device(type='cuda') or device(type='cpu')
332 | 
333 | # Perform operations on GPU
334 | if torch.cuda.is_available():
335 |     a = torch.tensor([1, 2, 3], device=device)
336 |     b = torch.tensor([4, 5, 6], device=device)
337 |     c = a + b  # Operation happens on GPU
338 |     print(c)   # tensor([5, 7, 9], device='cuda:0')
339 | ```
340 | 
341 | ## Computational Graphs
342 | 
343 | PyTorch uses a dynamic computational graph, which means the graph is built on-the-fly as operations are executed. This is different from static graph frameworks where the graph is defined before execution.
344 | 
345 | ```python
346 | # Create tensors with requires_grad=True to track operations
347 | x = torch.tensor(2.0, requires_grad=True)
348 | y = torch.tensor(3.0, requires_grad=True)
349 | 
350 | # Build a computational graph
351 | z = x**2 + y**3
352 | 
353 | # Compute gradients
354 | z.backward()
355 | 
356 | # Access gradients
357 | print(x.grad)  # tensor(4.) (dz/dx = 2*x = 2*2 = 4)
358 | print(y.grad)  # tensor(27.) (dz/dy = 3*y^2 = 3*3^2 = 27)
359 | 
360 | # Detach a tensor from the graph
361 | a = x.detach()  # Creates a new tensor that shares data but doesn't require gradients
362 | ```
363 | 
364 | ### Gradient Accumulation
365 | 
366 | By default, PyTorch accumulates gradients when `backward()` is called multiple times.
367 | 
368 | ```python
369 | # Reset gradients
370 | x.grad.zero_()
371 | y.grad.zero_()
372 | 
373 | # Compute gradients multiple times
374 | z = x**2 + y**3
375 | z.backward()
376 | print(x.grad)  # tensor(4.)
377 | 
378 | z = x**2 + y**3
379 | z.backward()
380 | print(x.grad)  # tensor(8.) (gradients are accumulated)
381 | 
382 | # To avoid accumulation, reset gradients before each backward pass
383 | x.grad.zero_()
384 | y.grad.zero_()
385 | ```
386 | 
387 | ## Conclusion
388 | 
389 | This tutorial covered the basics of PyTorch, including tensors, operations, NumPy integration, GPU acceleration, and computational graphs. These concepts form the foundation for building and training neural networks with PyTorch.
390 | 
391 | In the next tutorial, we'll explore automatic differentiation and optimization in more detail.


--------------------------------------------------------------------------------
/07_recurrent_neural_networks/README.md:
--------------------------------------------------------------------------------
  1 | # Recurrent Neural Networks (RNNs) in PyTorch: A Comprehensive Guide
  2 | 
  3 | This tutorial provides an in-depth guide to understanding, implementing, and applying Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs) using PyTorch. These models are fundamental for processing sequential data such as text, time series, and audio.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Recurrent Neural Networks](#introduction-to-recurrent-neural-networks)
  7 |    - What are RNNs and Why Sequential Data?
  8 |    - The Concept of a Hidden State (Memory)
  9 |    - Basic RNN Cell Structure and Unrolling
 10 |    - Challenges: Vanishing and Exploding Gradients
 11 | 2. [Core RNN Layer Implementations in PyTorch](#core-rnn-layer-implementations-in-pytorch)
 12 |    - **`nn.RNN`**: The basic Elman RNN.
 13 |      - Key Parameters: `input_size`, `hidden_size`, `num_layers`, `batch_first`, `bidirectional`.
 14 |      - Input and Output Shapes.
 15 |    - **`nn.LSTM` (Long Short-Term Memory)**
 16 |      - Addressing Vanishing Gradients with Gates (Forget, Input, Output Gates, Cell State).
 17 |      - Key Parameters and Shapes.
 18 |    - **`nn.GRU` (Gated Recurrent Unit)**
 19 |      - Simplified Gating Mechanism (Update, Reset Gates).
 20 |      - Key Parameters and Shapes.
 21 |    - Multi-layer (Stacked) RNNs
 22 |    - Bidirectional RNNs
 23 | 3. [Sequence Modeling with RNNs](#sequence-modeling-with-rnns)
 24 |    - Many-to-One, One-to-Many, Many-to-Many Architectures (Conceptual)
 25 |    - **Sequence Classification (Many-to-One):** e.g., Sentiment Analysis.
 26 |      - Using the final hidden state or pooling outputs for classification.
 27 |    - Handling Variable-Length Sequences: Padding, Packing (`torch.nn.utils.rnn.pack_padded_sequence`, `pad_packed_sequence`).
 28 | 4. [Application: Text Generation (Character-level RNN)](#application-text-generation-character-level-rnn)
 29 |    - Representing Text Data (Character Encoding).
 30 |    - Preparing Input-Target Sequences for Language Modeling.
 31 |    - Building a Character-level RNN/LSTM Model.
 32 |    - Training the Language Model.
 33 |    - Generating New Text (Sampling Strategies, Temperature).
 34 | 5. [Application: Time Series Forecasting](#application-time-series-forecasting)
 35 |    - Preparing Time Series Data (Windowing/Sliding Windows).
 36 |    - Univariate vs. Multivariate Time Series.
 37 |    - Building an RNN/LSTM Model for Forecasting.
 38 |    - Sequence-to-Sequence vs. Sequence-to-Value Forecasting.
 39 | 6. [Advanced RNN Techniques (Conceptual Overview)](#advanced-rnn-techniques-conceptual-overview)
 40 |    - **Attention Mechanisms:** Allowing the model to focus on relevant parts of the input sequence.
 41 |    - **Teacher Forcing:** Using ground truth outputs as inputs during training for faster convergence.
 42 |    - **Beam Search:** A more advanced decoding strategy for generation tasks.
 43 |    - **Encoder-Decoder Architecture (Seq2Seq):** For tasks like machine translation.
 44 | 7. [Practical Tips for Training RNNs](#practical-tips-for-training-rnns)
 45 |    - Gradient Clipping to prevent exploding gradients.
 46 |    - Proper Initialization.
 47 |    - Choosing between RNN, LSTM, GRU.
 48 |    - Regularization (Dropout on non-recurrent connections).
 49 | 
 50 | ## Introduction to Recurrent Neural Networks
 51 | 
 52 | - **What are RNNs and Why Sequential Data?**
 53 |   RNNs are a class of neural networks designed to recognize patterns in sequences of data, such as text, speech, time series, or genomes. Unlike feedforward networks, RNNs have loops, allowing information to persist from one step of the sequence to the next.
 54 | - **The Concept of a Hidden State (Memory):**
 55 |   The core idea of an RNN is its hidden state, which acts as a form of memory. The hidden state at timestep `t` captures information from all previous timesteps up to `t-1`. This hidden state is updated at each step based on the current input and the previous hidden state.
 56 |   `h_t = f(W_hh * h_{t-1} + W_xh * x_t + b_h)`
 57 |   `output_t = g(W_hy * h_t + b_y)`
 58 | - **Basic RNN Cell Structure and Unrolling:** An RNN can be thought of as multiple copies of the same network, each passing a message to a successor. Unrolling the RNN visualizes this chain-like structure.
 59 | - **Challenges: Vanishing and Exploding Gradients:** Standard RNNs struggle to learn long-range dependencies due to the vanishing gradient problem (gradients shrink exponentially as they propagate back through time) or the exploding gradient problem (gradients grow exponentially).
 60 | 
 61 | ## Core RNN Layer Implementations in PyTorch
 62 | 
 63 | PyTorch provides optimized implementations for common recurrent layers.
 64 | 
 65 | - **`nn.RNN`**: The basic Elman RNN.
 66 |   - **Key Parameters:**
 67 |     - `input_size`: The number of expected features in the input `x`.
 68 |     - `hidden_size`: The number of features in the hidden state `h`.
 69 |     - `num_layers`: Number of recurrent layers. Stacking RNNs can increase model capacity.
 70 |     - `nonlinearity`: `tanh` or `relu`. Default: `tanh`.
 71 |     - `batch_first (bool)`: If `True`, input and output tensors are provided as `(batch, seq, feature)` instead of `(seq, batch, feature)`. Default: `False`.
 72 |     - `dropout (float)`: If non-zero, introduces a Dropout layer on the outputs of each RNN layer except the last layer. Default: 0.
 73 |     - `bidirectional (bool)`: If `True`, becomes a bidirectional RNN. Default: `False`.
 74 |   - **Input Shapes (if `batch_first=False`):**
 75 |     - `input`: `(seq_len, batch_size, input_size)`
 76 |     - `h_0` (initial hidden state): `(num_layers * num_directions, batch_size, hidden_size)`
 77 |   - **Output Shapes (if `batch_first=False`):**
 78 |     - `output`: `(seq_len, batch_size, num_directions * hidden_size)` (all hidden states from the last layer)
 79 |     - `h_n` (final hidden state): `(num_layers * num_directions, batch_size, hidden_size)`
 80 |   ```python
 81 |   import torch
 82 |   import torch.nn as nn
 83 | 
 84 |   # Example nn.RNN
 85 |   rnn = nn.RNN(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
 86 |   # input_tensor shape: (batch_size=5, seq_len=3, input_size=10)
 87 |   # input_tensor = torch.randn(5, 3, 10)
 88 |   # h0 shape: (num_layers*num_directions=2*1, batch_size=5, hidden_size=20)
 89 |   # h0 = torch.randn(2, 5, 20)
 90 |   # output, hn = rnn(input_tensor, h0)
 91 |   # print(f"RNN Output shape: {output.shape}") # (5, 3, 20)
 92 |   # print(f"RNN Hidden state shape: {hn.shape}") # (2, 5, 20)
 93 |   ```
 94 | 
 95 | - **`nn.LSTM` (Long Short-Term Memory)**
 96 |   LSTMs use a more complex cell structure with gates (input, forget, output) and a cell state (`c_t`) to better control information flow and capture long-range dependencies, mitigating vanishing gradients.
 97 |   - **Gates:** Sigmoid layers that control what information to keep or discard.
 98 |   - **Cell State (`c_t`):** A separate memory stream that information can be added to or removed from, regulated by gates.
 99 |   - **Input/Output Shapes:** Similar to `nn.RNN`, but `h_0` and `h_n` are tuples `(hidden_state, cell_state)`. Each state has shape `(num_layers * num_directions, batch, hidden_size)`. 
100 |   ```python
101 |   # lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
102 |   # input_lstm = torch.randn(5, 3, 10)
103 |   # h0_lstm = torch.randn(2, 5, 20) # Initial hidden state
104 |   # c0_lstm = torch.randn(2, 5, 20) # Initial cell state
105 |   # output_lstm, (hn_lstm, cn_lstm) = lstm(input_lstm, (h0_lstm, c0_lstm))
106 |   # print(f"LSTM Output shape: {output_lstm.shape}")
107 |   # print(f"LSTM Hidden state shape: {hn_lstm.shape}")
108 |   # print(f"LSTM Cell state shape: {cn_lstm.shape}")
109 |   ```
110 | 
111 | - **`nn.GRU` (Gated Recurrent Unit)**
112 |   GRUs are a simpler alternative to LSTMs, combining the cell state and hidden state. They use update and reset gates.
113 |   - **Input/Output Shapes:** Same as `nn.RNN`. 
114 |   ```python
115 |   # gru = nn.GRU(input_size=10, hidden_size=20, num_layers=2, batch_first=True)
116 |   # input_gru = torch.randn(5, 3, 10)
117 |   # h0_gru = torch.randn(2, 5, 20)
118 |   # output_gru, hn_gru = gru(input_gru, h0_gru)
119 |   # print(f"GRU Output shape: {output_gru.shape}")
120 |   # print(f"GRU Hidden state shape: {hn_gru.shape}")
121 |   ```
122 | 
123 | - **Multi-layer (Stacked) RNNs:** Set `num_layers > 1`. The output of one layer becomes the input to the next. Dropout can be applied between layers.
124 | - **Bidirectional RNNs:** Set `bidirectional=True`. Processes the sequence in both forward and backward directions. The outputs are typically concatenated. Useful when context from both past and future is important.
125 | 
126 | ## Sequence Modeling with RNNs
127 | 
128 | - **Architectures:** RNNs can be used for various sequence tasks:
129 |   - **Many-to-One:** Input sequence, single output (e.g., sentiment classification of a sentence).
130 |   - **One-to-Many:** Single input, output sequence (e.g., image captioning).
131 |   - **Many-to-Many (Synchronized):** Input and output sequences have same length (e.g., part-of-speech tagging).
132 |   - **Many-to-Many (Delayed/Encoder-Decoder):** Input and output sequences can have different lengths (e.g., machine translation).
133 | - **Handling Variable-Length Sequences:** Real-world sequences often have different lengths. Techniques:
134 |   - **Padding:** Pad shorter sequences to the length of the longest sequence in a batch using a special padding token.
135 |   - **Packing (`torch.nn.utils.rnn.pack_padded_sequence`):** Before feeding padded sequences to an RNN, pack them to avoid computation on padding tokens. Use `torch.nn.utils.rnn.pad_packed_sequence` to unpack the output.
136 | 
137 | ## Application: Text Generation (Character-level RNN)
138 | 
139 | - **Representing Text Data:** Convert characters to numerical indices (character encoding). Create a vocabulary of all unique characters.
140 | - **Preparing Sequences:** For a sequence `s`, the input at timestep `t` is `s[t]` and the target is `s[t+1]`. The model learns to predict the next character.
141 | - **Training:** Use Cross-Entropy Loss to compare predicted character probabilities with the actual next character.
142 | - **Generating New Text:** Start with a seed character/sequence. Feed it to the model to get probabilities for the next character. Sample from this distribution (e.g., using `torch.multinomial` or `argmax`). Append the sampled character to the sequence and repeat.
143 |   - **Temperature:** A hyperparameter to control the randomness of sampling. Higher temperature -> more random; lower temperature -> more deterministic.
144 | 
145 | ## Application: Time Series Forecasting
146 | 
147 | - **Preparing Data (Windowing):** Create input-output pairs by sliding a window over the time series. Input: `(x_t, x_{t+1}, ..., x_{t+N-1})`. Target: `x_{t+N}` (for one-step ahead) or `(x_{t+N}, ..., x_{t+N+M-1})` (for multi-step ahead).
148 | - **Univariate vs. Multivariate:** Forecasting a single variable vs. multiple interacting variables.
149 | - **Model Output:** Can be a single value (next step) or a sequence (multiple future steps).
150 | 
151 | ## Advanced RNN Techniques (Conceptual Overview)
152 | 
153 | - **Attention Mechanisms:** For long sequences, allows the model to selectively focus on important parts of the input sequence when producing an output at each timestep. Particularly useful in Seq2Seq models.
154 | - **Teacher Forcing:** During training, instead of feeding the model's own (potentially incorrect) previous prediction as input for the next step, the ground truth from the previous step is used. Helps stabilize training but can lead to exposure bias (discrepancy between training and inference).
155 | - **Beam Search:** A decoding algorithm used in generation tasks (like machine translation or text generation) that explores multiple hypotheses (beams) at each step, rather than just greedily picking the single best option.
156 | - **Encoder-Decoder Architecture (Seq2Seq):** Consists of two RNNs: an encoder that processes the input sequence into a context vector, and a decoder that generates the output sequence from this context vector. Widely used in machine translation and text summarization.
157 | 
158 | ## Practical Tips for Training RNNs
159 | 
160 | - **Gradient Clipping:** Crucial for RNNs/LSTMs/GRUs to prevent exploding gradients. Use `torch.nn.utils.clip_grad_norm_`.
161 | - **Initialization:** Proper weight initialization (e.g., Xavier, Kaiming, or specific heuristics for RNNs) can be important.
162 | - **Choice of Unit:** LSTMs and GRUs are generally preferred over vanilla RNNs for their ability to handle longer sequences. GRUs are simpler and sometimes faster than LSTMs with comparable performance.
163 | - **Dropout:** Apply dropout between stacked RNN layers (using the `dropout` parameter in `nn.RNN/LSTM/GRU`) or on the non-recurrent connections (e.g., before/after the RNN block or between the RNN output and fully connected layers).
164 | 
165 | ## Running the Tutorial
166 | 
167 | To run the Python script associated with this tutorial:
168 | ```bash
169 | python recurrent_neural_networks.py
170 | ```
171 | This will execute demonstrations of RNN, LSTM, GRU layers, a character-level text generation example, and a time series forecasting example.
172 | 
173 | ## Prerequisites
174 | - Python 3.7+
175 | - PyTorch 1.10+
176 | - NumPy
177 | - Matplotlib (for visualization)
178 | 
179 | ## Related Tutorials
180 | 1. [Training Neural Networks](../04_training_neural_networks/README.md)
181 | 2. [Transformers and Attention Mechanisms](../08_transformers_and_attention_mechanisms/README.md) (Modern alternative/successor to RNNs for many sequence tasks) 


--------------------------------------------------------------------------------
/05_data_loading_preprocessing/README.md:
--------------------------------------------------------------------------------
  1 | # Data Loading, Preprocessing, and Augmentation in PyTorch
  2 | 
  3 | This tutorial provides a comprehensive guide to efficiently loading, preprocessing, and augmenting data in PyTorch. Effective data handling is a critical step in any machine learning pipeline, ensuring that your model receives data in the correct format and benefits from techniques that can improve generalization.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction: The Importance of Data Handling](#introduction-the-importance-of-data-handling)
  7 | 2. [PyTorch `Dataset` Class](#pytorch-dataset-class)
  8 |    - Role and Purpose
  9 |    - Key Methods: `__init__`, `__len__`, `__getitem__`
 10 |    - Using Built-in Datasets (e.g., `torchvision.datasets.MNIST`, `CIFAR10`)
 11 | 3. [Creating Custom `Dataset`s](#creating-custom-datasets)
 12 |    - For Image Data (e.g., from a folder of images, from a CSV file with paths)
 13 |    - For Text Data (e.g., loading text files, tokenization basics)
 14 |    - For Other Data Types (e.g., CSV, time series)
 15 | 4. [PyTorch `DataLoader` Class](#pytorch-dataloader-class)
 16 |    - Purpose: Batching, Shuffling, Parallel Loading
 17 |    - Key Parameters: `dataset`, `batch_size`, `shuffle`, `num_workers`, `pin_memory`
 18 |    - Iterating Through a `DataLoader`
 19 | 5. [Data Transformations (`torchvision.transforms`)](#data-transformations-torchvisiontransforms)
 20 |    - Common Transformations for Images:
 21 |      - `transforms.ToTensor()`: Converting PIL Images/NumPy arrays to Tensors.
 22 |      - `transforms.Normalize()`: Normalizing tensor images.
 23 |      - Resizing, Cropping (`transforms.Resize`, `transforms.CenterCrop`, `transforms.RandomResizedCrop`)
 24 |      - `transforms.Compose()`: Chaining multiple transformations.
 25 |    - Creating Custom Transformations
 26 | 6. [Data Augmentation](#data-augmentation)
 27 |    - Why Augment Data? Improving Model Robustness and Generalization.
 28 |    - Image Augmentation Techniques (using `torchvision.transforms`):
 29 |      - Random Flips (`transforms.RandomHorizontalFlip`, `transforms.RandomVerticalFlip`)
 30 |      - Random Rotations (`transforms.RandomRotation`)
 31 |      - Color Jitter (`transforms.ColorJitter`)
 32 |      - Random Affine Transformations (`transforms.RandomAffine`)
 33 |    - Integrating Augmentations into the `Dataset` or `DataLoader` Flow
 34 |    - Advanced Augmentation Libraries (e.g., Albumentations - conceptual mention)
 35 | 7. [Working with Different Data Types](#working-with-different-data-types)
 36 |    - **Image Data:** Loading, common formats, channel orders.
 37 |    - **Text Data:** Tokenization, padding, creating vocabulary, embedding lookups (conceptual).
 38 |    - **Tabular Data:** Loading from CSV/Pandas, feature engineering, encoding categorical features (conceptual).
 39 | 8. [Efficient Data Loading Techniques](#efficient-data-loading-techniques)
 40 |    - `num_workers` in `DataLoader`: Parallelizing data loading.
 41 |    - `pin_memory=True` in `DataLoader`: Faster CPU-to-GPU data transfer.
 42 |    - Pre-fetching and Caching Strategies (Conceptual)
 43 |    - Considerations for Large Datasets that Don't Fit in Memory
 44 | 9. [Practical Example: Image Classification Dataset](#practical-example-image-classification-dataset)
 45 |    - Setting up a custom image folder dataset.
 46 |    - Applying transformations and augmentations.
 47 |    - Using `DataLoader` for training.
 48 | 
 49 | ## Introduction: The Importance of Data Handling
 50 | 
 51 | Raw data is rarely in a format suitable for direct input into a neural network. Data loading and preprocessing involve several steps:
 52 | - **Loading:** Reading data from various sources (files, databases).
 53 | - **Preprocessing:** Cleaning, transforming, and structuring data (e.g., resizing images, tokenizing text, normalizing features).
 54 | - **Augmentation:** Artificially expanding the dataset by creating modified versions of existing data (e.g., rotating images, paraphrasing text) to improve model generalization and reduce overfitting.
 55 | Efficient data handling is crucial for training performance, as data loading can become a bottleneck if not optimized.
 56 | 
 57 | ## PyTorch `Dataset` Class
 58 | 
 59 | - **Role and Purpose:** `torch.utils.data.Dataset` is an abstract class representing a dataset. All datasets in PyTorch that interact with `DataLoader` should inherit from this class.
 60 | - **Key Methods:**
 61 |   - `__init__(self, ...)`: Initializes the dataset (e.g., loads data paths, labels, performs initial setup).
 62 |   - `__len__(self)`: Returns the total number of samples in the dataset.
 63 |   - `__getitem__(self, idx)`: Loads and returns a single sample from the dataset at the given index `idx`. This is where transformations are often applied.
 64 | - **Using Built-in Datasets:** `torchvision.datasets` provides many common datasets like MNIST, CIFAR10, ImageNet, which are subclasses of `Dataset`.
 65 | 
 66 | ```python
 67 | import torchvision
 68 | import torchvision.transforms as transforms
 69 | 
 70 | # Example: Using torchvision.datasets.MNIST
 71 | mnist_train_raw = torchvision.datasets.MNIST(root='./data', train=True, download=True)
 72 | sample_raw, label_raw = mnist_train_raw[0]
 73 | print(f"MNIST raw sample type: {type(sample_raw)}, Label: {label_raw}")
 74 | 
 75 | # Applying a transform to convert PIL Image to Tensor
 76 | mnist_train_transformed = torchvision.datasets.MNIST(
 77 |     root='./data', 
 78 |     train=True, 
 79 |     download=True, 
 80 |     transform=transforms.ToTensor() # Converts PIL Image to FloatTensor
 81 | )
 82 | sample_tensor, label_tensor = mnist_train_transformed[0]
 83 | print(f"MNIST transformed sample type: {type(sample_tensor)}, shape: {sample_tensor.shape}, Label: {label_tensor}")
 84 | ```
 85 | 
 86 | ## Creating Custom `Dataset`s
 87 | 
 88 | For most real-world applications, you'll need to create your own custom `Dataset`.
 89 | 
 90 | - **For Image Data:** Often involves reading image files (e.g., JPEG, PNG) and their corresponding labels.
 91 |   ```python
 92 |   from torch.utils.data import Dataset
 93 |   from PIL import Image # Pillow library for image manipulation
 94 |   import os
 95 | 
 96 |   class CustomImageDataset(Dataset):
 97 |       def __init__(self, img_dir, transform=None, target_transform=None):
 98 |           # Example: img_dir contains subfolders for each class (e.g., img_dir/cat/cat1.jpg)
 99 |           self.img_labels = [] # List of (image_path, class_index)
100 |           self.classes = sorted(entry.name for entry in os.scandir(img_dir) if entry.is_dir())
101 |           self.class_to_idx = {cls_name: i for i, cls_name in enumerate(self.classes)}
102 |           
103 |           for class_name in self.classes:
104 |               class_dir = os.path.join(img_dir, class_name)
105 |               for img_name in os.listdir(class_dir):
106 |                   self.img_labels.append((os.path.join(class_dir, img_name), self.class_to_idx[class_name]))
107 |           
108 |           self.transform = transform
109 |           self.target_transform = target_transform
110 | 
111 |       def __len__(self):
112 |           return len(self.img_labels)
113 | 
114 |       def __getitem__(self, idx):
115 |           img_path, label = self.img_labels[idx]
116 |           image = Image.open(img_path).convert("RGB") # Ensure 3 channels
117 |           if self.transform:
118 |               image = self.transform(image)
119 |           if self.target_transform:
120 |               label = self.target_transform(label)
121 |           return image, label
122 |   ```
123 | - **For Text Data:** Might involve reading lines from files, tokenizing text into numerical representations, and padding sequences.
124 | 
125 | ## PyTorch `DataLoader` Class
126 | 
127 | - **Purpose:** `torch.utils.data.DataLoader` takes a `Dataset` object and provides an iterable to easily access batches of data. It automates batching, shuffling, and can use multiple worker processes for parallel data loading.
128 | - **Key Parameters:**
129 |   - `dataset`: The `Dataset` object from which to load the data.
130 |   - `batch_size (int, optional)`: How many samples per batch to load (default: 1).
131 |   - `shuffle (bool, optional)`: Set to `True` to have the data reshuffled at every epoch (default: `False`).
132 |   - `num_workers (int, optional)`: How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process (default: 0).
133 |   - `pin_memory (bool, optional)`: If `True`, the `DataLoader` will copy Tensors into CUDA pinned memory before returning them. Useful for faster CPU to GPU transfers.
134 | 
135 | ```python
136 | from torch.utils.data import DataLoader
137 | 
138 | # Assuming mnist_train_transformed is an instance of a Dataset
139 | # train_loader = DataLoader(mnist_train_transformed, batch_size=64, shuffle=True, num_workers=2)
140 | 
141 | # Iterating through a DataLoader
142 | # for epoch in range(num_epochs):
143 | #     for i, (inputs, labels) in enumerate(train_loader):
144 | #         # inputs and labels are now batches of data
145 | #         # Move to device: inputs, labels = inputs.to(device), labels.to(device)
146 | #         # ... training logic ...
147 | #         if i % 100 == 0:
148 | #             print(f"Epoch {epoch}, Batch {i}, Input shape: {inputs.shape}")
149 | ```
150 | 
151 | ## Data Transformations (`torchvision.transforms`)
152 | 
153 | `torchvision.transforms` provides common image transformations. They can be chained together using `transforms.Compose()`.
154 | 
155 | - **Common Transformations:**
156 |   - `transforms.ToTensor()`: Converts a PIL Image or `numpy.ndarray` (H x W x C) in the range [0, 255] to a `torch.FloatTensor` of shape (C x H x W) in the range [0.0, 1.0].
157 |   - `transforms.Normalize(mean, std)`: Normalizes a tensor image with mean and standard deviation. `output[channel] = (input[channel] - mean[channel]) / std[channel]`.
158 |   - `transforms.Resize(size)`: Resizes the input PIL Image to the given size.
159 |   - `transforms.CenterCrop(size)`: Crops the given PIL Image at the center.
160 | 
161 | ```python
162 | # Example of composing transformations
163 | image_transforms = transforms.Compose([
164 |     transforms.Resize(256),
165 |     transforms.CenterCrop(224),
166 |     transforms.ToTensor(),
167 |     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # ImageNet stats
168 | ])
169 | 
170 | # my_dataset = CustomImageDataset(..., transform=image_transforms)
171 | ```
172 | 
173 | ## Data Augmentation
174 | 
175 | Data augmentation artificially increases the training set size by creating modified copies of its data. This helps the model become more robust to variations and reduces overfitting.
176 | 
177 | - **Image Augmentation Techniques:**
178 |   - `transforms.RandomHorizontalFlip(p=0.5)`
179 |   - `transforms.RandomRotation(degrees)`
180 |   - `transforms.ColorJitter(brightness=0, contrast=0, saturation=0, hue=0)`
181 |   - `transforms.RandomResizedCrop(size)`: Crops a random part of an image and resizes it.
182 | 
183 | ```python
184 | # Example augmentation pipeline for training
185 | train_transforms_augmented = transforms.Compose([
186 |     transforms.RandomResizedCrop(224),
187 |     transforms.RandomHorizontalFlip(),
188 |     transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
189 |     transforms.RandomRotation(degrees=15),
190 |     transforms.ToTensor(),
191 |     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
192 | ])
193 | # For validation/testing, typically only use non-random transformations like Resize, CenterCrop, ToTensor, Normalize.
194 | ```
195 | 
196 | ## Working with Different Data Types
197 | Conceptual overview; detailed implementations depend on the specific task.
198 | - **Image Data:** Use PIL/OpenCV for loading, `torchvision.transforms` for preprocessing/augmentation. Pay attention to channel order (e.g., RGB vs BGR) and normalization.
199 | - **Text Data:** Involves tokenization (splitting text into words/subwords), numericalization (mapping tokens to integers), padding sequences to the same length, and often using pre-trained embeddings or an `nn.Embedding` layer.
200 | - **Tabular Data:** Often loaded using Pandas. Numerical features might need scaling/normalization. Categorical features need encoding (e.g., one-hot encoding, label encoding, or embedding layers).
201 | 
202 | ## Efficient Data Loading Techniques
203 | 
204 | - **`num_workers > 0`:** Spawns multiple subprocesses to load data in parallel, preventing the main training process from waiting for data I/O.
205 | - **`pin_memory=True`:** If using GPUs, setting this to `True` in `DataLoader` tells PyTorch to put fetched data Tensors in pinned (page-locked) memory. This enables faster data transfer from CPU to GPU memory via Direct Memory Access (DMA).
206 | - **Caching/Pre-fetching:** For very large datasets or slow storage, caching frequently accessed data or pre-fetching next batches can help.
207 | 
208 | ## Practical Example: Image Classification Dataset
209 | 
210 | This section will be detailed in the accompanying Python script (`data_loading_preprocessing.py`) and Jupyter Notebook, showing an end-to-end example of loading an image dataset from folders, applying transformations, and using `DataLoader`.
211 | 
212 | ## Running the Tutorial
213 | 
214 | To run the Python script associated with this tutorial:
215 | ```bash
216 | python data_loading_preprocessing.py
217 | ```
218 | We recommend you manually create a `data_loading_preprocessing.ipynb` notebook and copy the code from the Python script into it for an interactive experience.
219 | 
220 | ## Prerequisites
221 | - Python 3.7+
222 | - PyTorch 1.10+
223 | - Torchvision (for built-in datasets and transforms)
224 | - Pillow (PIL Fork, usually a dependency of Torchvision: `pip install Pillow`)
225 | - NumPy
226 | 
227 | ## Related Tutorials
228 | 1. [PyTorch Basics](../01_pytorch_basics/README.md)
229 | 2. [Training Neural Networks](../04_training_neural_networks/README.md)
230 | 3. [Convolutional Neural Networks](../06_convolutional_neural_networks/README.md) 


--------------------------------------------------------------------------------
/06_convolutional_neural_networks/README.md:
--------------------------------------------------------------------------------
  1 | # Convolutional Neural Networks (CNNs) in PyTorch
  2 | 
  3 | This tutorial provides a comprehensive guide to understanding and implementing Convolutional Neural Networks (CNNs) using PyTorch. CNNs are a class of deep neural networks most commonly applied to analyzing visual imagery, but also effective for other types of data like audio and text.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Convolutional Neural Networks](#introduction-to-convolutional-neural-networks)
  7 |    - What are CNNs and Why Use Them for Images?
  8 |    - Key Concepts: Local Receptive Fields, Shared Weights, Pooling
  9 | 2. [Core CNN Layers and Components](#core-cnn-layers-and-components)
 10 |    - **Convolutional Layers (`nn.Conv2d`)**
 11 |      - Kernels (Filters): Size, Stride, Padding, Dilation
 12 |      - Input and Output Channels
 13 |      - Feature Maps
 14 |      - 2D Convolution Operation Explained
 15 |    - **Activation Functions (ReLU)**
 16 |      - Role in CNNs
 17 |    - **Pooling Layers (`nn.MaxPool2d`, `nn.AvgPool2d`)**
 18 |      - Purpose: Down-sampling, Dimensionality Reduction, Invariance
 19 |      - Max Pooling vs. Average Pooling
 20 |      - Kernel Size and Stride
 21 |    - **Fully Connected Layers (`nn.Linear`)**
 22 |      - Role in Classification/Regression after Convolutional Base
 23 |      - Flattening Feature Maps
 24 |    - **Batch Normalization (`nn.BatchNorm2d`)**
 25 |      - Normalizing Activations in CNNs
 26 |    - **Dropout (`nn.Dropout2d`, `nn.Dropout`)**
 27 |      - Regularization in CNNs
 28 | 3. [Building a Basic CNN Architecture](#building-a-basic-cnn-architecture)
 29 |    - Stacking Convolutional, Activation, and Pooling Layers
 30 |    - Adding Fully Connected Layers for Classification
 31 |    - Example CNN for MNIST or CIFAR-10
 32 | 4. [Training CNNs for Image Classification](#training-cnns-for-image-classification)
 33 |    - Data Preparation: Image Transforms and Augmentation specific to CNNs
 34 |    - Loss Function (e.g., `nn.CrossEntropyLoss`)
 35 |    - Optimizer (e.g., Adam, SGD)
 36 |    - The Training Loop (Revisiting with CNN context)
 37 | 5. [Understanding and Implementing Famous CNN Architectures (Conceptual Overview)](#understanding-and-implementing-famous-cnn-architectures-conceptual-overview)
 38 |    - **LeNet-5:** A pioneering CNN.
 39 |    - **AlexNet:** Deepened the architecture, used ReLUs and Dropout.
 40 |    - **VGGNets:** Simplicity with deeper stacks of small (3x3) convolutions.
 41 |    - **GoogLeNet (Inception):** Introduced Inception modules for efficiency and multi-scale processing.
 42 |    - **ResNet (Residual Networks):** Introduced residual connections to train very deep networks.
 43 |    - (Implementation of one simple architecture like LeNet-5 will be in the .py script)
 44 | 6. [Transfer Learning with Pre-trained CNN Models](#transfer-learning-with-pre-trained-cnn-models)
 45 |    - What is Transfer Learning?
 46 |    - Benefits: Reduced training time, better performance with less data.
 47 |    - Using Pre-trained Models from `torchvision.models` (e.g., ResNet, VGG).
 48 |    - **Feature Extraction:** Using the pre-trained CNN as a fixed feature extractor by freezing its weights and replacing the classifier head.
 49 |    - **Fine-tuning:** Unfreezing some of the later layers of the pre-trained model and training them with a smaller learning rate on the new dataset.
 50 | 7. [Visualizing What CNNs Learn (Feature Visualization - Conceptual)](#visualizing-what-cnns-learn-feature-visualization---conceptual)
 51 |    - Understanding intermediate feature maps.
 52 |    - Visualizing Convolutional Filters (first layer).
 53 |    - Techniques like Saliency Maps, Class Activation Maps (CAM), Grad-CAM (Conceptual Overview).
 54 | 8. [Practical Tips for Training CNNs](#practical-tips-for-training-cnns)
 55 |    - Data Augmentation is Key
 56 |    - Appropriate Learning Rates and Schedulers
 57 |    - Choosing Batch Size (considering GPU memory)
 58 |    - Regularization (Dropout, Weight Decay)
 59 |    - Monitoring Validation Performance
 60 | 
 61 | ## Introduction to Convolutional Neural Networks
 62 | 
 63 | - **What are CNNs and Why Use Them for Images?**
 64 |   CNNs are specialized neural networks designed to process data with a grid-like topology, such as images (2D grid of pixels) or audio (1D grid of time samples). They are highly effective for image-related tasks because they can automatically and adaptively learn spatial hierarchies of features from low-level edges and textures to high-level object parts and concepts.
 65 | - **Key Concepts:**
 66 |   - **Local Receptive Fields:** Each neuron in a convolutional layer is connected to only a small region of the input volume (its local receptive field), allowing it to learn local features.
 67 |   - **Shared Weights (Parameter Sharing):** The same set of weights (kernel/filter) is used across different spatial locations in the input. This drastically reduces the number of parameters and makes the model equivariant to translations of features.
 68 |   - **Pooling:** Summarizes features in a neighborhood, providing a degree of translation invariance and reducing dimensionality.
 69 | 
 70 | ## Core CNN Layers and Components
 71 | 
 72 | - **Convolutional Layers (`nn.Conv2d`)**
 73 |   The core building block of a CNN. It performs a convolution operation, sliding a learnable filter (kernel) over the input.
 74 |   - **Kernels (Filters):** Small matrices of learnable parameters. Each kernel is responsible for detecting a specific feature (e.g., an edge, a texture). The depth of the kernel matches the depth (number of channels) of its input.
 75 |   - **Input and Output Channels:** `in_channels` is the number of channels in the input volume (e.g., 3 for RGB images). `out_channels` is the number of filters applied, determining the depth of the output feature map.
 76 |   - **Feature Maps:** The output of a convolutional layer. Each channel in the output feature map corresponds to the response of a specific filter across the input.
 77 |   - **Parameters:**
 78 |     - `kernel_size (int or tuple)`: Size of the filter (e.g., 3 for 3x3, (3,5) for 3x5).
 79 |     - `stride (int or tuple, optional)`: Step size with which the filter slides over the input (default: 1).
 80 |     - `padding (int or tuple, optional)`: Amount of zero-padding added to the borders of the input (default: 0). Padding can help control the spatial size of the output feature map and preserve border information.
 81 |     - `dilation (int or tuple, optional)`: Spacing between kernel elements (default: 1).
 82 |   ```python
 83 |   import torch
 84 |   import torch.nn as nn
 85 | 
 86 |   # Example: Conv2d layer
 87 |   # Input: Batch of 16 images, 3 channels (RGB), 32x32 pixels
 88 |   # Output: 32 feature maps (output channels), spatial size depends on kernel, stride, padding
 89 |   conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
 90 |   # input_tensor = torch.randn(16, 3, 32, 32) # Batch, Channels, Height, Width
 91 |   # output_feature_map = conv1(input_tensor)
 92 |   # print(f"Output feature map shape: {output_feature_map.shape}") # e.g., [16, 32, 32, 32]
 93 |   ```
 94 | 
 95 | - **Activation Functions (ReLU)**
 96 |   Typically, a non-linear activation function like ReLU (`nn.ReLU()`) is applied element-wise after each convolutional operation to introduce non-linearity.
 97 | 
 98 | - **Pooling Layers (`nn.MaxPool2d`, `nn.AvgPool2d`)**
 99 |   Reduce the spatial dimensions (height and width) of the feature maps, reducing computation and parameters, and providing a form of translation invariance.
100 |   - `nn.MaxPool2d(kernel_size, stride=None)`: Selects the maximum value from each patch of the feature map covered by the pooling window.
101 |   - `nn.AvgPool2d(kernel_size, stride=None)`: Computes the average value.
102 |   ```python
103 |   # pool = nn.MaxPool2d(kernel_size=2, stride=2) # Reduces H and W by factor of 2
104 |   # pooled_output = pool(output_feature_map) # Assuming output_feature_map from conv1
105 |   # print(f"Pooled output shape: {pooled_output.shape}") # e.g., [16, 32, 16, 16]
106 |   ```
107 | 
108 | - **Fully Connected Layers (`nn.Linear`)**
109 |   After several convolutional and pooling layers, the high-level features are typically flattened and fed into one or more fully connected layers for classification or regression.
110 |   - **Flattening:** Converting the 3D feature maps (Channels x Height x Width) into a 1D vector.
111 | 
112 | - **Batch Normalization (`nn.BatchNorm2d`)**
113 |   Applied after convolutional layers (and before or after activation) to normalize the activations across the batch. Helps stabilize training, allows higher learning rates, and can act as a regularizer.
114 | 
115 | - **Dropout (`nn.Dropout2d`, `nn.Dropout`)**
116 |   `nn.Dropout2d` randomly zeros out entire channels during training. `nn.Dropout` (1D dropout) is used for fully connected layers. Helps prevent overfitting.
117 | 
118 | ## Building a Basic CNN Architecture
119 | 
120 | A typical CNN architecture pattern:
121 | `INPUT -> [[CONV -> ACT -> POOL] * N -> FLATTEN -> [FC -> ACT] * M -> FC (Output)]`
122 | 
123 | ```python
124 | class SimpleCNN(nn.Module):
125 |     def __init__(self, num_classes=10):
126 |         super(SimpleCNN, self).__init__()
127 |         self.conv_block1 = nn.Sequential(
128 |             nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5, stride=1, padding=2), # MNIST: 1 channel
129 |             nn.ReLU(),
130 |             nn.MaxPool2d(kernel_size=2, stride=2) # Output: 16 x 14 x 14 (for 28x28 input)
131 |         )
132 |         self.conv_block2 = nn.Sequential(
133 |             nn.Conv2d(in_channels=16, out_channels=32, kernel_size=5, stride=1, padding=2),
134 |             nn.ReLU(),
135 |             nn.MaxPool2d(kernel_size=2, stride=2) # Output: 32 x 7 x 7
136 |         )
137 |         # After two max pooling layers of stride 2, a 28x28 image becomes 7x7.
138 |         # So, the flattened size is 32 (channels) * 7 * 7.
139 |         self.fc = nn.Linear(32 * 7 * 7, num_classes)
140 | 
141 |     def forward(self, x): # Input x shape: [batch_size, 1, 28, 28] for MNIST
142 |         x = self.conv_block1(x)
143 |         x = self.conv_block2(x)
144 |         x = x.view(x.size(0), -1) # Flatten the feature maps: [batch_size, 32*7*7]
145 |         x = self.fc(x)
146 |         return x # Raw logits for classification
147 | 
148 | # model_cnn = SimpleCNN(num_classes=10) # For MNIST (10 digits)
149 | # print(model_cnn)
150 | ```
151 | 
152 | ## Training CNNs for Image Classification
153 | 
154 | Training involves the same general steps as other neural networks, but with data and augmentations tailored for images.
155 | - **Data Preparation:** Use `torchvision.transforms` for normalization, resizing, and data augmentation (random flips, rotations, crops, color jitter, etc.).
156 | - **Loss Function:** `nn.CrossEntropyLoss` is standard for multi-class image classification.
157 | - **Optimizer:** Adam or SGD with momentum are common choices.
158 | 
159 | ## Understanding and Implementing Famous CNN Architectures (Conceptual Overview)
160 | 
161 | - **LeNet-5:** One of the earliest successful CNNs, designed for digit recognition.
162 | - **AlexNet:** Won the ImageNet LSVRC-2012. Deeper than LeNet, used ReLU, Dropout, and data augmentation extensively.
163 | - **VGGNets:** Showed that depth is critical. Used very small (3x3) convolutional filters stacked deeply.
164 | - **GoogLeNet (Inception):** Introduced the "Inception module," which performs convolutions at multiple scales in parallel and concatenates their outputs, improving performance and computational efficiency.
165 | - **ResNet (Residual Networks):** Enabled training of extremely deep networks (hundreds of layers) by introducing "residual connections" (skip connections) that allow gradients to propagate more easily.
166 | 
167 | ## Transfer Learning with Pre-trained CNN Models
168 | 
169 | Leveraging models pre-trained on large datasets (like ImageNet) can significantly boost performance on smaller, related datasets.
170 | 
171 | - **`torchvision.models`:** Provides access to many pre-trained models (ResNet, VGG, Inception, MobileNet, etc.).
172 |   ```python
173 |   import torchvision.models as models
174 |   # resnet18_pretrained = models.resnet18(pretrained=True) # PyTorch < 0.13
175 |   # resnet18_pretrained = models.resnet18(weights=models.ResNet18_Weights.DEFAULT) # PyTorch >= 0.13
176 |   ```
177 | - **Feature Extraction:** Freeze the weights of the convolutional base of the pre-trained model and replace its final classification layer with a new one suited to your task. Train only the new classifier.
178 | - **Fine-tuning:** Unfreeze some of the top layers of the pre-trained model in addition to training the new classifier. Use a small learning rate to avoid catastrophically forgetting the learned features.
179 | 
180 | ## Visualizing What CNNs Learn (Feature Visualization - Conceptual)
181 | 
182 | Understanding the internal workings of CNNs can be aided by visualizing:
183 | - **Filters:** Especially in the first layer, filters often learn to detect simple patterns like edges, corners, and color blobs.
184 | - **Feature Maps (Activations):** Show which regions of an image activate certain filters/channels at different layers, revealing the hierarchical feature extraction process.
185 | - **Saliency Maps/Class Activation Maps (CAM/Grad-CAM):** Highlight the image regions most influential in a model's prediction for a specific class.
186 | 
187 | ## Practical Tips for Training CNNs
188 | - Start with a standard architecture (e.g., ResNet variant) and pre-trained weights if applicable.
189 | - Aggressive data augmentation is often very beneficial.
190 | - Use appropriate learning rates, often starting higher and decaying (e.g., with a scheduler).
191 | - Batch Normalization is generally helpful.
192 | - Monitor training and validation metrics closely.
193 | 
194 | ## Running the Tutorial
195 | 
196 | To run the Python script associated with this tutorial:
197 | ```bash
198 | python convolutional_neural_networks.py
199 | ```
200 | We recommend you manually create a `convolutional_neural_networks.ipynb` notebook and copy the code from the Python script into it for an interactive experience.
201 | 
202 | ## Prerequisites
203 | - Python 3.7+
204 | - PyTorch 1.10+
205 | - Torchvision
206 | - NumPy
207 | - Matplotlib (for visualization)
208 | 
209 | ## Related Tutorials
210 | 1. [Data Loading and Preprocessing](../05_data_loading_preprocessing/README.md)
211 | 2. [Training Neural Networks](../04_training_neural_networks/README.md)
212 | 3. [Recurrent Neural Networks](../07_recurrent_neural_networks/README.md) (for sequence data) 


--------------------------------------------------------------------------------
/02_neural_networks_fundamentals/README.md:
--------------------------------------------------------------------------------
  1 | # Neural Networks Fundamentals in PyTorch
  2 | 
  3 | This tutorial provides a comprehensive introduction to the fundamental concepts of neural networks and their implementation using PyTorch. We will cover the building blocks of neural networks, how they learn, and how to construct your first neural network.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Neural Networks](#introduction-to-neural-networks)
  7 |    - What is a Neural Network?
  8 |    - Biological Inspiration
  9 |    - Basic Components: Neurons, Weights, Biases, Layers
 10 |    - Types of Neural Networks (Brief Overview)
 11 | 2. [The Perceptron: The Simplest Neural Network](#the-perceptron-the-simplest-neural-network)
 12 |    - Single-Layer Perceptron
 13 |    - Linear Separability
 14 | 3. [Activation Functions](#activation-functions)
 15 |    - Purpose: Introducing Non-linearity
 16 |    - Common Activation Functions:
 17 |      - Sigmoid
 18 |      - Tanh (Hyperbolic Tangent)
 19 |      - ReLU (Rectified Linear Unit) and its variants (Leaky ReLU, ELU)
 20 |      - Softmax (for output layers in classification)
 21 |    - Choosing an Activation Function
 22 |    - PyTorch Implementation
 23 | 4. [Multi-Layer Perceptrons (MLPs)](#multi-layer-perceptrons-mlps)
 24 |    - Architecture: Input, Hidden, and Output Layers
 25 |    - The Power of Hidden Layers: Universal Approximation Theorem (Concept)
 26 |    - Forward Propagation in an MLP
 27 | 5. [Defining a Neural Network in PyTorch (`nn.Module`)](#defining-a-neural-network-in-pytorch-nnmodule)
 28 |    - The `nn.Module` Class
 29 |    - Defining Layers (`nn.Linear`, etc.)
 30 |    - Implementing the `forward` method
 31 |    - Example: A Simple MLP for Classification
 32 | 6. [Loss Functions: Measuring Model Error](#loss-functions-measuring-model-error)
 33 |    - Purpose of Loss Functions
 34 |    - Common Loss Functions:
 35 |      - Mean Squared Error (MSE) (`nn.MSELoss`): For Regression
 36 |      - Cross-Entropy Loss (`nn.CrossEntropyLoss`): For Multi-class Classification
 37 |      - Binary Cross-Entropy Loss (`nn.BCELoss`, `nn.BCEWithLogitsLoss`): For Binary Classification
 38 |    - Choosing the Right Loss Function
 39 | 7. [Optimizers: How Neural Networks Learn](#optimizers-how-neural-networks-learn)
 40 |    - Gradient Descent (Concept)
 41 |    - Stochastic Gradient Descent (SGD)
 42 |    - SGD with Momentum
 43 |    - Adam Optimizer (`torch.optim.Adam`)
 44 |    - Learning Rate
 45 |    - Linking Optimizers to Model Parameters
 46 | 8. [The Training Loop: Forward and Backward Propagation](#the-training-loop-forward-and-backward-propagation)
 47 |    - Overview of the Training Process
 48 |    - **Forward Propagation:** Calculating Predictions and Loss
 49 |    - **Backward Propagation (Backpropagation):** Calculating Gradients (`loss.backward()`)
 50 |    - **Optimizer Step:** Updating Weights (`optimizer.step()`)
 51 |    - Zeroing Gradients (`optimizer.zero_grad()`)
 52 |    - Iterating over Data (Epochs and Batches)
 53 | 9. [Building and Training Your First Neural Network in PyTorch](#building-and-training-your-first-neural-network-in-pytorch)
 54 |    - Step 1: Prepare the Data (e.g., a simple synthetic dataset)
 55 |    - Step 2: Define the Model (using `nn.Module`)
 56 |    - Step 3: Define Loss Function and Optimizer
 57 |    - Step 4: Implement the Training Loop
 58 |    - Step 5: Evaluate the Model (Conceptual)
 59 | 
 60 | ## Introduction to Neural Networks
 61 | 
 62 | - **What is a Neural Network?**
 63 |   An Artificial Neural Network (ANN) is a computational model inspired by the structure and function of biological neural networks in the human brain. It consists of interconnected processing units called neurons (or nodes) organized in layers.
 64 | - **Biological Inspiration:** Neurons in the brain receive signals, process them, and transmit signals to other neurons. ANNs attempt to mimic this behavior mathematically.
 65 | - **Basic Components:**
 66 |   - **Neurons (Nodes):** Basic computational units that receive inputs, perform a calculation (typically a weighted sum followed by an activation function), and produce an output.
 67 |   - **Weights:** Parameters associated with each input to a neuron, representing the strength or importance of that input.
 68 |   - **Biases:** Additional parameters added to the weighted sum, allowing the neuron to be activated even when all inputs are zero, or shifting the activation function.
 69 |   - **Layers:** Neurons are organized into layers: an input layer, one or more hidden layers, and an output layer.
 70 | - **Types of Neural Networks:** Feedforward Neural Networks (FNNs), Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformers, etc. This tutorial focuses on FNNs (specifically MLPs).
 71 | 
 72 | ## The Perceptron: The Simplest Neural Network
 73 | 
 74 | - **Single-Layer Perceptron:** The simplest form of a neural network, consisting of a single layer of output neurons. Inputs are fed directly to the outputs via a series of weights. It performs a weighted sum of inputs and applies an activation function (often a step function).
 75 |   `output = activation(sum(weights_i * input_i) + bias)`
 76 | - **Linear Separability:** A single-layer perceptron can only solve linearly separable problems.
 77 | 
 78 | ## Activation Functions
 79 | 
 80 | - **Purpose:** Activation functions introduce non-linearity into the network. Without non-linearity, a multi-layer network would behave like a single-layer linear network, severely limiting its ability to model complex relationships.
 81 | - **Common Activation Functions:**
 82 |   - **Sigmoid:** `f(x) = 1 / (1 + exp(-x))`. Squashes values between 0 and 1. Used in older networks, can suffer from vanishing gradients.
 83 |   - **Tanh (Hyperbolic Tangent):** `f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))`. Squashes values between -1 and 1. Also prone to vanishing gradients but often preferred over sigmoid in hidden layers as it's zero-centered.
 84 |   - **ReLU (Rectified Linear Unit):** `f(x) = max(0, x)`. Computationally efficient, helps alleviate vanishing gradients. Most popular choice for hidden layers.
 85 |   - **Leaky ReLU:** `f(x) = max(0.01*x, x)`. Addresses the "dying ReLU" problem by allowing a small, non-zero gradient when the unit is not active.
 86 |   - **Softmax:** `f(x_i) = exp(x_i) / sum(exp(x_j))`. Used in the output layer of multi-class classification tasks to convert raw scores (logits) into probabilities that sum to 1.
 87 | 
 88 | ```python
 89 | import torch
 90 | import torch.nn as nn
 91 | import torch.nn.functional as F
 92 | 
 93 | # Examples of activation functions
 94 | sigmoid = nn.Sigmoid()
 95 | relu = nn.ReLU()
 96 | tanh = nn.Tanh()
 97 | softmax = nn.Softmax(dim=1) # Apply softmax across a specific dimension
 98 | 
 99 | input_tensor = torch.randn(2, 3) # Batch of 2, 3 features each
100 | print("Input:\n", input_tensor)
101 | print("Sigmoid output:\n", sigmoid(input_tensor))
102 | print("ReLU output:\n", relu(input_tensor))
103 | print("Tanh output:\n", tanh(input_tensor))
104 | # For softmax, let's assume these are logits for 2 samples, 3 classes
105 | print("Softmax output:\n", softmax(input_tensor))
106 | ```
107 | 
108 | ## Multi-Layer Perceptrons (MLPs)
109 | 
110 | MLPs are feedforward neural networks with one or more hidden layers between the input and output layers. Each layer is fully connected to the next.
111 | 
112 | - **Architecture:**
113 |   - **Input Layer:** Receives the raw input data.
114 |   - **Hidden Layer(s):** Perform intermediate computations. The number of hidden layers and neurons per layer are hyperparameters.
115 |   - **Output Layer:** Produces the final prediction.
116 | - **Universal Approximation Theorem:** (Conceptual) An MLP with at least one hidden layer and a non-linear activation function can approximate any continuous function to an arbitrary degree of accuracy, given enough neurons.
117 | - **Forward Propagation:** The process of passing input data through the network layer by layer to compute the output.
118 |   `h1 = activation1(W1*x + b1)`
119 |   `h2 = activation2(W2*h1 + b2)`
120 |   `output = activation_out(W_out*h2 + b_out)`
121 | 
122 | ## Defining a Neural Network in PyTorch (`nn.Module`)
123 | 
124 | PyTorch provides the `nn.Module` class as a base for all neural network modules.
125 | 
126 | - **The `nn.Module` Class:**
127 |   - Your custom network should inherit from `nn.Module`.
128 |   - Layers are defined as attributes in the `__init__` method.
129 |   - The `forward` method defines how input data flows through the network.
130 | - **Defining Layers:** PyTorch offers various predefined layers in `torch.nn`:
131 |   - `nn.Linear(in_features, out_features)`: Applies a linear transformation (fully connected layer).
132 |   - `nn.Conv2d`, `nn.RNN`, etc., for other network types.
133 | 
134 | ```python
135 | class SimpleMLP(nn.Module):
136 |     def __init__(self, input_size, hidden_size, num_classes):
137 |         super(SimpleMLP, self).__init__()
138 |         self.fc1 = nn.Linear(input_size, hidden_size) # Input layer to hidden layer
139 |         self.relu = nn.ReLU()                         # Activation function
140 |         self.fc2 = nn.Linear(hidden_size, num_classes) # Hidden layer to output layer
141 | 
142 |     def forward(self, x):
143 |         # x is the input tensor
144 |         out = self.fc1(x)
145 |         out = self.relu(out)
146 |         out = self.fc2(out)
147 |         # No softmax here if using nn.CrossEntropyLoss, as it combines Softmax and NLLLoss
148 |         return out
149 | 
150 | # Example usage
151 | input_dim = 784 # e.g., for flattened 28x28 MNIST images
152 | hidden_dim = 128
153 | output_dim = 10   # e.g., for 10 digit classes
154 | model_mlp = SimpleMLP(input_dim, hidden_dim, output_dim)
155 | print(model_mlp)
156 | ```
157 | 
158 | ## Loss Functions: Measuring Model Error
159 | 
160 | Loss functions (or cost functions) quantify how far the model's predictions are from the actual target values.
161 | 
162 | - **Common Loss Functions:**
163 |   - **`nn.MSELoss` (Mean Squared Error):** For regression tasks. `loss = (1/N) * sum((y_true - y_pred)^2)`.
164 |   - **`nn.CrossEntropyLoss`:** For multi-class classification. It conveniently combines `nn.LogSoftmax` and `nn.NLLLoss`. Expects raw logits as model output.
165 |   - **`nn.BCELoss` (Binary Cross-Entropy Loss):** For binary classification. Expects model output to be probabilities (after a Sigmoid activation).
166 |   - **`nn.BCEWithLogitsLoss`:** For binary classification. More numerically stable than `nn.BCELoss` as it combines Sigmoid and BCE. Expects raw logits.
167 | 
168 | ```python
169 | # Example Loss Functions
170 | loss_mse = nn.MSELoss()
171 | loss_ce = nn.CrossEntropyLoss()
172 | loss_bce_logits = nn.BCEWithLogitsLoss()
173 | 
174 | # For MSE (Regression)
175 | predictions_reg = torch.randn(5, 1) # 5 samples, 1 output value
176 | targets_reg = torch.randn(5, 1)
177 | mse = loss_mse(predictions_reg, targets_reg)
178 | print(f"MSE Loss: {mse.item()}")
179 | 
180 | # For CrossEntropy (Multi-class classification)
181 | predictions_mc = torch.randn(5, 3) # 5 samples, 3 classes (logits)
182 | targets_mc = torch.tensor([0, 1, 2, 0, 1]) # True class indices
183 | ce = loss_ce(predictions_mc, targets_mc)
184 | print(f"CrossEntropy Loss: {ce.item()}")
185 | 
186 | # For BCEWithLogits (Binary classification)
187 | predictions_bc = torch.randn(5, 1) # 5 samples, 1 output logit
188 | targets_bc = torch.rand(5, 1)      # True probabilities (or 0s and 1s)
189 | bce_wl = loss_bce_logits(predictions_bc, targets_bc)
190 | print(f"BCEWithLogits Loss: {bce_wl.item()}")
191 | ```
192 | 
193 | ## Optimizers: How Neural Networks Learn
194 | 
195 | Optimizers implement algorithms to update the model's weights based on the gradients computed during backpropagation, aiming to minimize the loss function.
196 | 
197 | - **Gradient Descent:** Iteratively moves in the direction opposite to the gradient of the loss function.
198 | - **Stochastic Gradient Descent (SGD):** Uses a single training example or a small batch to compute the gradient and update weights, making it faster and often able to escape local minima.
199 |   `optimizer = torch.optim.SGD(model.parameters(), lr=0.01)`
200 | - **SGD with Momentum:** Adds a fraction of the previous update vector to the current one, helping accelerate SGD in the relevant direction and dampening oscillations.
201 |   `optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)`
202 | - **Adam (Adaptive Moment Estimation):** An adaptive learning rate optimization algorithm that computes individual learning rates for different parameters. Often a good default choice.
203 |   `optimizer = torch.optim.Adam(model.parameters(), lr=0.001)`
204 | - **Learning Rate (lr):** A crucial hyperparameter that controls the step size during weight updates.
205 | 
206 | ## The Training Loop: Forward and Backward Propagation
207 | 
208 | The core process of training a neural network involves repeatedly feeding data to the model and adjusting its weights.
209 | 
210 | - **Forward Propagation:** Input data is passed through the network to generate predictions. The loss function then compares these predictions to the true targets to compute the loss.
211 |   `outputs = model(inputs)`
212 |   `loss = criterion(outputs, labels)`
213 | - **Backward Propagation (Backpropagation):** The `loss.backward()` call computes the gradients of the loss with respect to all model parameters (weights and biases) that have `requires_grad=True`.
214 | - **Optimizer Step:** The `optimizer.step()` call updates the model parameters using the computed gradients and the optimizer's update rule (e.g., SGD, Adam).
215 | - **Zeroing Gradients:** Before each `loss.backward()` call in a new iteration, it's crucial to clear old gradients using `optimizer.zero_grad()`. Otherwise, gradients would accumulate across iterations.
216 | - **Epochs and Batches:**
217 |   - **Epoch:** One complete pass through the entire training dataset.
218 |   - **Batch:** A subset of the training dataset processed in one iteration of the training loop.
219 | 
220 | ## Building and Training Your First Neural Network in PyTorch
221 | 
222 | This section will be detailed in the accompanying Python script (`neural_networks_fundamentals.py`) and Jupyter Notebook, showing a complete end-to-end example.
223 | 
224 | **Conceptual Steps:**
225 | 1.  **Prepare Data:** Load and preprocess your dataset. PyTorch uses `Dataset` and `DataLoader` classes.
226 | 2.  **Define Model:** Create your neural network class inheriting from `nn.Module`.
227 | 3.  **Define Loss and Optimizer:** Instantiate your chosen loss function and optimizer, linking the optimizer to your model's parameters.
228 | 4.  **Training Loop:**
229 |     ```python
230 |     # num_epochs = ...
231 |     # for epoch in range(num_epochs):
232 |     #     for i, (inputs, labels) in enumerate(train_loader):
233 |     #         # Move tensors to the configured device (CPU/GPU)
234 |     #         inputs = inputs.to(device)
235 |     #         labels = labels.to(device)
236 |     #
237 |     #         # Forward pass
238 |     #         outputs = model(inputs)
239 |     #         loss = criterion(outputs, labels)
240 |     #
241 |     #         # Backward and optimize
242 |     #         optimizer.zero_grad()
243 |     #         loss.backward()
244 |     #         optimizer.step()
245 |     #
246 |     #         if (i+1) % 100 == 0:
247 |     #             print (f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')
248 |     ```
249 | 5.  **Evaluate Model:** Assess performance on a separate test dataset.
250 | 
251 | ## Running the Tutorial
252 | 
253 | To run the Python script associated with this tutorial:
254 | ```bash
255 | python neural_networks_fundamentals.py
256 | ```
257 | Alternatively, you can follow along with the Jupyter notebook `neural_networks_fundamentals.ipynb` for an interactive experience. We recommend manually creating the notebook and copying code from the script if direct creation fails.
258 | 
259 | ## Prerequisites
260 | - Python 3.7+
261 | - PyTorch 1.10+
262 | - NumPy
263 | - Matplotlib (for visualization)
264 | - Scikit-learn (for generating sample data or splitting)
265 | 
266 | ## Related Tutorials
267 | 1. [PyTorch Basics](../01_pytorch_basics/README.md)
268 | 2. [Automatic Differentiation](../03_automatic_differentiation/README.md)
269 | 3. [Training Neural Networks](../04_training_neural_networks/README.md)


--------------------------------------------------------------------------------
/03_automatic_differentiation/README.md:
--------------------------------------------------------------------------------
  1 | # Automatic Differentiation with PyTorch Autograd
  2 | 
  3 | This tutorial provides a detailed explanation of PyTorch's automatic differentiation system, known as Autograd. Understanding Autograd is crucial for training neural networks as it automates the computation of gradients.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Automatic Differentiation](#introduction-to-automatic-differentiation)
  7 |    - What is Differentiation?
  8 |    - Manual vs. Symbolic vs. Automatic Differentiation
  9 |    - Why Automatic Differentiation for Deep Learning?
 10 | 2. [PyTorch Autograd: The Basics](#pytorch-autograd-the-basics)
 11 |    - Tensors and `requires_grad`
 12 |    - The `grad_fn` (Gradient Function)
 13 |    - Computing Gradients: `backward()`
 14 |    - Accessing Gradients: `.grad` attribute
 15 | 3. [The Computational Graph](#the-computational-graph)
 16 |    - Dynamic Computational Graphs in PyTorch
 17 |    - How Autograd Constructs the Graph
 18 |    - Nodes and Edges: Tensors and Operations
 19 |    - Leaf Nodes vs. Non-Leaf Nodes
 20 | 4. [Gradient Accumulation](#gradient-accumulation)
 21 |    - How Gradients Accumulate by Default
 22 |    - Zeroing Gradients: `optimizer.zero_grad()` or `tensor.grad.zero_()`
 23 |    - Use Cases for Gradient Accumulation (e.g., simulating larger batch sizes)
 24 | 5. [Excluding Tensors from Autograd (`torch.no_grad()`, `detach()`)](#excluding-tensors-from-autograd-torchnograd-detach)
 25 |    - `torch.no_grad()`: Context manager to disable gradient computation.
 26 |    - `.detach()`: Creates a new tensor that shares the same data but is detached from the computation history.
 27 |    - Use cases: Inference, freezing layers, modifying tensors without tracking.
 28 | 6. [Gradients of Non-Scalar Outputs (Vector-Jacobian Product)](#gradients-of-non-scalar-outputs-vector-jacobian-product)
 29 |    - `backward()` on a non-scalar tensor requires a `gradient` argument.
 30 |    - Understanding the Vector-Jacobian Product (JVP) concept.
 31 |    - Practical examples.
 32 | 7. [Higher-Order Derivatives](#higher-order-derivatives)
 33 |    - Computing gradients of gradients.
 34 |    - Using `torch.autograd.grad()` for more control.
 35 |    - `create_graph=True` in `backward()` or `torch.autograd.grad()`.
 36 | 8. [In-place Operations and Autograd](#in-place-operations-and-autograd)
 37 |    - Potential issues with in-place operations (ending with `_`).
 38 |    - Autograd's need for original values for gradient computation.
 39 |    - When they might be problematic and when they are safe.
 40 | 9. [Custom Autograd Functions (`torch.autograd.Function`)](#custom-autograd-functions-torchautogradfunction)
 41 |    - When to use: Implementing novel operations, non-PyTorch computations.
 42 |    - Subclassing `torch.autograd.Function`.
 43 |    - Defining `forward()` and `backward()` static methods.
 44 |    - `ctx` (context) object for saving tensors for backward pass.
 45 |    - Example: A custom ReLU or a simple custom operation.
 46 | 10. [Practical Considerations and Tips](#practical-considerations-and-tips)
 47 |     - Checking if a tensor requires gradients: `tensor.requires_grad`.
 48 |     - Checking if a tensor is a leaf tensor: `tensor.is_leaf`.
 49 |     - Memory usage: Autograd stores intermediate values for backward pass.
 50 |     - `retain_graph=True` in `backward()`: When needed and its implications.
 51 | 
 52 | ## Introduction to Automatic Differentiation
 53 | 
 54 | - **What is Differentiation?** Finding the rate of change of a function with respect to its input variables (i.e., its derivatives or gradients).
 55 | - **Manual Differentiation:** Deriving gradients by hand. Tedious and error-prone for complex functions like neural networks.
 56 | - **Symbolic Differentiation:** Using computer algebra systems to manipulate mathematical expressions and find derivatives (e.g., Wolfram Alpha, SymPy). Can lead to complex and inefficient expressions.
 57 | - **Automatic Differentiation (AD):** A set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD decomposes the computation into a sequence of elementary operations (addition, multiplication, sin, exp, etc.) and applies the chain rule repeatedly.
 58 |   - **Reverse Mode AD:** What PyTorch uses. Computes gradients by traversing the computational graph backward from output to input. Efficient for functions with many inputs and few outputs (like neural network loss functions).
 59 | - **Why AD for Deep Learning?** Neural networks are complex functions with millions of parameters. AD (specifically reverse mode) provides an efficient and accurate way to compute the gradients of the loss function with respect to all these parameters, which is essential for gradient-based optimization (like SGD).
 60 | 
 61 | ## PyTorch Autograd: The Basics
 62 | 
 63 | PyTorch's `autograd` package provides automatic differentiation for all operations on Tensors.
 64 | 
 65 | - **Tensors and `requires_grad`:**
 66 |   - If a `Tensor` has its `requires_grad` attribute set to `True`, PyTorch tracks all operations on it. This is typically done for learnable parameters (weights, biases) or tensors that are part of a computation leading to a value for which gradients are needed.
 67 |   - You can set `requires_grad=True` when creating a tensor or later using `tensor.requires_grad_(True)` (in-place).
 68 | - **The `grad_fn` (Gradient Function):**
 69 |   - When an operation is performed on tensors that require gradients, the resulting tensor will have a `grad_fn` attribute. This function knows how to compute the gradient of that operation during the backward pass.
 70 |   - Leaf tensors (created by the user, not as a result of an operation) with `requires_grad=True` will have `grad_fn=None` initially, but their `.grad` attribute will be populated after `backward()`.
 71 | - **Computing Gradients: `backward()`:**
 72 |   - To compute gradients, you call `.backward()` on a scalar tensor (e.g., the loss). If the tensor is non-scalar, you need to provide a `gradient` argument (see Section 6).
 73 |   - This initiates the backward pass, computing gradients for all tensors in the computational graph that have `requires_grad=True`.
 74 | - **Accessing Gradients: `.grad` attribute:**
 75 |   - After `loss.backward()` is called, the gradients are accumulated in the `.grad` attribute of the leaf tensors (those for which `requires_grad=True`).
 76 | 
 77 | ```python
 78 | import torch
 79 | 
 80 | # Example 1: Basic gradient computation
 81 | x = torch.tensor(2.0, requires_grad=True)
 82 | y = torch.tensor(3.0, requires_grad=True)
 83 | z = x**2 + y**3 # z = 2^2 + 3^3 = 4 + 27 = 31
 84 | 
 85 | # Compute gradients
 86 | z.backward() # Computes dz/dx and dz/dy
 87 | 
 88 | print(f"x: {x}, Gradient dz/dx: {x.grad}") # dz/dx = 2*x = 2*2 = 4
 89 | print(f"y: {y}, Gradient dz/dy: {y.grad}") # dz/dy = 3*y^2 = 3*3^2 = 27
 90 | 
 91 | # grad_fn example
 92 | print(f"z.grad_fn: {z.grad_fn}") # Should show <AddBackward0>
 93 | print(f"x.grad_fn: {x.grad_fn}") # Leaf tensor, no grad_fn from previous op
 94 | ```
 95 | 
 96 | ## The Computational Graph
 97 | 
 98 | - **Dynamic Computational Graphs:** PyTorch builds the computational graph on-the-fly as operations are executed (define-by-run). This allows for more flexibility in model architecture (e.g., using standard Python control flow like loops and conditionals).
 99 | - **How Autograd Constructs the Graph:** Each operation on tensors with `requires_grad=True` creates a new node in the graph. Tensors are nodes, and operations (`grad_fn`) are edges that define how to compute gradients.
100 | - **Leaf Nodes:** Tensors created directly by the user (e.g., `torch.tensor(...)`, model parameters). Their gradients are accumulated in `.grad`.
101 | - **Non-Leaf Nodes (Intermediate Tensors):** Tensors resulting from operations. They have a `grad_fn`. By default, their gradients are not saved to save memory, but can be retained using `tensor.retain_grad()`.
102 | 
103 | ## Gradient Accumulation
104 | 
105 | - **How Gradients Accumulate:** When `backward()` is called multiple times (e.g., in a loop without zeroing gradients), gradients are summed (accumulated) in the `.grad` attribute of leaf tensors.
106 | - **Zeroing Gradients:** It's crucial to zero out gradients before each new backward pass in a typical training loop using `optimizer.zero_grad()` or by manually setting `tensor.grad.zero_()` for each parameter. Otherwise, gradients from previous batches/iterations will interfere.
107 | - **Use Cases for Accumulation:** Deliberate gradient accumulation can be used to simulate a larger effective batch size when GPU memory is limited. You perform several forward/backward passes accumulating gradients and then perform an optimizer step.
108 | 
109 | ```python
110 | x = torch.tensor(1.0, requires_grad=True)
111 | y1 = x * 2
112 | y2 = x * 3
113 | 
114 | # First backward pass
115 | y1.backward(retain_graph=True) # retain_graph needed if y2.backward() follows on same graph portion
116 | print(f"After y1.backward(), x.grad: {x.grad}") # dy1/dx = 2
117 | 
118 | # Second backward pass (gradients accumulate)
119 | y2.backward()
120 | print(f"After y2.backward(), x.grad: {x.grad}") # 2 (from y1) + 3 (from y2) = 5
121 | 
122 | # Zeroing gradients
123 | x.grad.zero_()
124 | print(f"After x.grad.zero_(), x.grad: {x.grad}")
125 | ```
126 | 
127 | ## Excluding Tensors from Autograd (`torch.no_grad()`, `detach()`)
128 | 
129 | - **`torch.no_grad()`:** A context manager that disables gradient computation within its block. Useful for inference (when you don't need gradients) or when modifying model parameters without tracking these changes (e.g., during evaluation).
130 | - **`.detach()`:** Creates a new tensor that shares the same data as the original tensor but is detached from the current computational graph. It won't require gradients, and no operations on it will be tracked. Useful if you need to use a tensor in a computation that shouldn't be part of the gradient calculation, or to copy a tensor without its history.
131 | 
132 | ```python
133 | a = torch.tensor([1.0, 2.0], requires_grad=True)
134 | b = a * 2
135 | 
136 | with torch.no_grad():
137 |     c = a * 3 # Operation inside no_grad block
138 |     print(f"c.requires_grad inside no_grad: {c.requires_grad}") # False
139 | 
140 | d = b.detach() # d shares data with b but is detached
141 | print(f"b.requires_grad: {b.requires_grad}") # True
142 | print(f"d.requires_grad: {d.requires_grad}") # False
143 | ```
144 | 
145 | ## Gradients of Non-Scalar Outputs (Vector-Jacobian Product)
146 | 
147 | - If `backward()` is called on a tensor `y` that is not a scalar (e.g., a vector or matrix), PyTorch expects a `gradient` argument. This argument should be a tensor of the same shape as `y` and represents the vector `v` in the vector-Jacobian product `v^T * J`.
148 | - **Vector-Jacobian Product:** Autograd is designed to compute Jacobian-vector products efficiently. If `y = f(x)` and `L` is a scalar loss computed from `y` (i.e., `L = g(y)`), then `dL/dx = (dL/dy) * (dy/dx)`. Here, `dL/dy` is the vector `v` you pass to `y.backward(v)`.
149 | - If you just want the full Jacobian matrix, you'd have to call `backward()` multiple times with one-hot vectors for `gradient`, which is inefficient. `torch.autograd.functional.jacobian` can be used for this if needed.
150 | 
151 | ```python
152 | x = torch.randn(3, requires_grad=True)
153 | y = x * 2       # y is a vector
154 | # y.backward() # This would raise an error
155 | 
156 | # Provide gradient argument for non-scalar output
157 | # This is equivalent to if we had a scalar loss L = sum(y*v)
158 | # and then called L.backward(). The gradient for x would be 2*v.
159 | v = torch.tensor([0.1, 1.0, 0.001], dtype=torch.float)
160 | y.backward(gradient=v)
161 | print(f"x.grad after y.backward(v): {x.grad}") # Expected: 2*v = [0.2, 2.0, 0.002]
162 | ```
163 | 
164 | ## Higher-Order Derivatives
165 | 
166 | - PyTorch can compute gradients of gradients (and so on).
167 | - **`torch.autograd.grad()`:** A more flexible way to compute gradients. It takes the output tensor(s) and input tensor(s) and returns the gradients of outputs with respect to inputs.
168 | - **`create_graph=True`:** To compute higher-order derivatives, you need to set `create_graph=True` when calling `backward()` or `torch.autograd.grad()`. This tells Autograd to build a computational graph for the backward pass itself, allowing subsequent differentiation.
169 | 
170 | ```python
171 | x = torch.tensor(2.0, requires_grad=True)
172 | y = x**3
173 | 
174 | # First derivative (dy/dx)
175 | grad_y_x = torch.autograd.grad(outputs=y, inputs=x, create_graph=True)[0]
176 | print(f"dy/dx = 3*x^2 = {grad_y_x}") # 3 * 2^2 = 12
177 | 
178 | # Second derivative (d^2y/dx^2)
179 | grad2_y_x2 = torch.autograd.grad(outputs=grad_y_x, inputs=x)[0]
180 | print(f"d^2y/dx^2 = 6*x = {grad2_y_x2}") # 6 * 2 = 12
181 | ```
182 | 
183 | ## In-place Operations and Autograd
184 | 
185 | - In-place operations (e.g., `x.add_(1)`, `y.relu_()`) modify tensors directly without creating new ones. This can save memory.
186 | - **Potential Issues:** Autograd needs the original values of tensors involved in the forward pass to compute gradients correctly during the backward pass. If an in-place operation overwrites a value that's needed, it can lead to errors or incorrect gradients.
187 | - PyTorch will often raise an error if an in-place operation that would cause issues is detected (e.g., modifying a leaf variable or a variable needed by `grad_fn`).
188 | 
189 | ## Custom Autograd Functions (`torch.autograd.Function`)
190 | 
191 | - For operations not natively supported by PyTorch, or if you want to define a custom gradient computation (e.g., for a layer written in C++ or CUDA, or to implement a non-differentiable function with a surrogate gradient).
192 | - **Subclass `torch.autograd.Function`:** Implement `forward()` and `backward()` as static methods.
193 |   - `forward(ctx, input1, input2, ...)`: Performs the operation. `ctx` (context) is used to save tensors or any other objects needed for the backward pass using `ctx.save_for_backward(tensor1, tensor2)`. It must return the output tensor(s).
194 |   - `backward(ctx, grad_output1, grad_output2, ...)`: Computes the gradients of the loss with respect to the inputs of the forward function. It receives the gradients of the loss with respect to the outputs of forward (`grad_output`). It must return as many tensors as there were inputs to `forward`, or `None` for inputs that don't need gradients.
195 | 
196 | ```python
197 | class MyCustomReLU(torch.autograd.Function):
198 |     @staticmethod
199 |     def forward(ctx, input_tensor):
200 |         # ctx is a context object that can be used to stash information
201 |         # for backward computation
202 |         ctx.save_for_backward(input_tensor)
203 |         return input_tensor.clamp(min=0)
204 | 
205 |     @staticmethod
206 |     def backward(ctx, grad_output):
207 |         # We return as many input gradients as there were arguments.
208 |         # Gradients of non-Tensor arguments to forward must be None.
209 |         input_tensor, = ctx.saved_tensors
210 |         grad_input = grad_output.clone()
211 |         grad_input[input_tensor < 0] = 0
212 |         return grad_input
213 | 
214 | # Usage:
215 | my_relu_fn = MyCustomReLU.apply # Get the function to use
216 | x = torch.tensor([-1.0, 2.0, -0.5], requires_grad=True)
217 | y = my_relu_fn(x)
218 | print(f"Custom ReLU Output: {y}")
219 | y.backward(torch.tensor([1.0, 1.0, 1.0])) # Example upstream gradients
220 | print(f"Gradients for x after custom ReLU: {x.grad}") # Expected: [0., 1., 0.]
221 | ```
222 | 
223 | ## Practical Considerations and Tips
224 | - **`tensor.requires_grad`**: Check if a tensor is tracking history.
225 | - **`tensor.is_leaf`**: Check if a tensor is a leaf node in the graph.
226 | - **Memory Usage**: Autograd stores intermediate activations for the backward pass. For very large models or long sequences, this can lead to high memory usage. Techniques like gradient checkpointing can help.
227 | - **`retain_graph=True`**: Use in `backward()` if you need to perform another backward pass from the same part of the graph. Be mindful of memory implications.
228 | 
229 | ## Running the Tutorial
230 | 
231 | To run the Python script associated with this tutorial:
232 | ```bash
233 | python automatic_differentiation.py
234 | ```
235 | We recommend you manually create an `automatic_differentiation.ipynb` notebook and copy the code from the Python script into it for an interactive experience.
236 | 
237 | ## Prerequisites
238 | - Python 3.7+
239 | - PyTorch 1.10+
240 | - NumPy
241 | 
242 | ## Related Tutorials
243 | 1. [PyTorch Basics](../01_pytorch_basics/README.md)
244 | 2. [Neural Networks Fundamentals](../02_neural_networks_fundamentals/README.md)
245 | 3. [Training Neural Networks](../04_training_neural_networks/README.md) 


--------------------------------------------------------------------------------
/14_performance_optimization/performance_optimization.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Tutorial 14: Performance Optimization
  3 | =====================================
  4 | 
  5 | This tutorial covers comprehensive performance optimization techniques
  6 | for PyTorch models, from profiling to advanced optimization strategies.
  7 | """
  8 | 
  9 | import torch
 10 | import torch.nn as nn
 11 | import torch.nn.functional as F
 12 | from torch.utils.data import DataLoader, Dataset
 13 | import torchvision
 14 | import torchvision.transforms as transforms
 15 | import time
 16 | import numpy as np
 17 | from torch.profiler import profile, record_function, ProfilerActivity
 18 | import torch.cuda.amp as amp
 19 | from torch.nn.parallel import DataParallel, DistributedDataParallel
 20 | import torch.distributed as dist
 21 | import torch.multiprocessing as mp
 22 | import os
 23 | import psutil
 24 | import gc
 25 | 
 26 | # Set device
 27 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
 28 | print(f"Using device: {device}")
 29 | print()
 30 | 
 31 | # Example 1: Basic Profiling
 32 | print("Example 1: PyTorch Profiler")
 33 | print("=" * 50)
 34 | 
 35 | class SimpleModel(nn.Module):
 36 |     def __init__(self):
 37 |         super().__init__()
 38 |         self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
 39 |         self.conv2 = nn.Conv2d(64, 128, 3, padding=1)
 40 |         self.fc1 = nn.Linear(128 * 8 * 8, 256)
 41 |         self.fc2 = nn.Linear(256, 10)
 42 |         
 43 |     def forward(self, x):
 44 |         x = F.relu(self.conv1(x))
 45 |         x = F.max_pool2d(x, 2)
 46 |         x = F.relu(self.conv2(x))
 47 |         x = F.max_pool2d(x, 2)
 48 |         x = x.view(x.size(0), -1)
 49 |         x = F.relu(self.fc1(x))
 50 |         x = self.fc2(x)
 51 |         return x
 52 | 
 53 | # Profile the model
 54 | model = SimpleModel().to(device)
 55 | inputs = torch.randn(32, 3, 32, 32).to(device)
 56 | 
 57 | # Use PyTorch profiler
 58 | with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
 59 |              record_shapes=True,
 60 |              profile_memory=True,
 61 |              with_stack=True) as prof:
 62 |     with record_function("model_inference"):
 63 |         for _ in range(10):
 64 |             model(inputs)
 65 | 
 66 | # Print profiler results
 67 | print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
 68 | print()
 69 | 
 70 | # Example 2: Memory Optimization
 71 | print("Example 2: Memory Optimization")
 72 | print("=" * 50)
 73 | 
 74 | def get_memory_usage():
 75 |     if torch.cuda.is_available():
 76 |         return torch.cuda.memory_allocated() / 1024**2  # MB
 77 |     else:
 78 |         return psutil.Process().memory_info().rss / 1024**2  # MB
 79 | 
 80 | # Memory-efficient gradient checkpointing
 81 | class CheckpointedModel(nn.Module):
 82 |     def __init__(self):
 83 |         super().__init__()
 84 |         self.layers = nn.ModuleList([
 85 |             nn.Sequential(
 86 |                 nn.Linear(1024, 1024),
 87 |                 nn.ReLU(),
 88 |                 nn.Dropout(0.1)
 89 |             ) for _ in range(10)
 90 |         ])
 91 |         self.final = nn.Linear(1024, 10)
 92 |     
 93 |     def forward(self, x):
 94 |         for layer in self.layers:
 95 |             # Use checkpoint to trade compute for memory
 96 |             x = torch.utils.checkpoint.checkpoint(layer, x)
 97 |         return self.final(x)
 98 | 
 99 | # Compare memory usage
100 | print("Memory usage comparison:")
101 | x = torch.randn(128, 1024).to(device)
102 | 
103 | # Without checkpointing
104 | regular_model = nn.Sequential(*[
105 |     nn.Sequential(nn.Linear(1024, 1024), nn.ReLU(), nn.Dropout(0.1))
106 |     for _ in range(10)
107 | ] + [nn.Linear(1024, 10)]).to(device)
108 | 
109 | mem_before = get_memory_usage()
110 | y1 = regular_model(x)
111 | loss1 = y1.sum()
112 | loss1.backward()
113 | mem_regular = get_memory_usage() - mem_before
114 | print(f"Regular model: {mem_regular:.2f} MB")
115 | 
116 | # With checkpointing
117 | checkpointed_model = CheckpointedModel().to(device)
118 | optimizer = torch.optim.Adam(checkpointed_model.parameters())
119 | optimizer.zero_grad()
120 | 
121 | mem_before = get_memory_usage()
122 | y2 = checkpointed_model(x)
123 | loss2 = y2.sum()
124 | loss2.backward()
125 | mem_checkpoint = get_memory_usage() - mem_before
126 | print(f"Checkpointed model: {mem_checkpoint:.2f} MB")
127 | print(f"Memory saved: {(1 - mem_checkpoint/mem_regular)*100:.1f}%")
128 | print()
129 | 
130 | # Example 3: Mixed Precision Training
131 | print("Example 3: Mixed Precision Training")
132 | print("=" * 50)
133 | 
134 | # Create a more complex model for mixed precision demo
135 | class MixedPrecisionModel(nn.Module):
136 |     def __init__(self):
137 |         super().__init__()
138 |         self.features = nn.Sequential(
139 |             nn.Conv2d(3, 64, 3, padding=1),
140 |             nn.BatchNorm2d(64),
141 |             nn.ReLU(),
142 |             nn.Conv2d(64, 128, 3, padding=1),
143 |             nn.BatchNorm2d(128),
144 |             nn.ReLU(),
145 |             nn.AdaptiveAvgPool2d(1)
146 |         )
147 |         self.classifier = nn.Linear(128, 10)
148 |     
149 |     def forward(self, x):
150 |         x = self.features(x)
151 |         x = x.view(x.size(0), -1)
152 |         x = self.classifier(x)
153 |         return x
154 | 
155 | # Training with mixed precision
156 | def train_with_amp(model, dataloader, epochs=2):
157 |     model = model.to(device)
158 |     optimizer = torch.optim.Adam(model.parameters())
159 |     scaler = amp.GradScaler()
160 |     
161 |     model.train()
162 |     total_time = 0
163 |     
164 |     for epoch in range(epochs):
165 |         epoch_start = time.time()
166 |         for i, (inputs, targets) in enumerate(dataloader):
167 |             if i >= 10:  # Limit iterations for demo
168 |                 break
169 |                 
170 |             inputs, targets = inputs.to(device), targets.to(device)
171 |             
172 |             optimizer.zero_grad()
173 |             
174 |             # Mixed precision forward pass
175 |             with amp.autocast():
176 |                 outputs = model(inputs)
177 |                 loss = F.cross_entropy(outputs, targets)
178 |             
179 |             # Scaled backward pass
180 |             scaler.scale(loss).backward()
181 |             scaler.step(optimizer)
182 |             scaler.update()
183 |         
184 |         epoch_time = time.time() - epoch_start
185 |         total_time += epoch_time
186 |     
187 |     return total_time / epochs
188 | 
189 | # Create dummy dataset
190 | class DummyDataset(Dataset):
191 |     def __init__(self, size=1000):
192 |         self.size = size
193 |     
194 |     def __len__(self):
195 |         return self.size
196 |     
197 |     def __getitem__(self, idx):
198 |         return torch.randn(3, 32, 32), torch.randint(0, 10, (1,)).item()
199 | 
200 | dataset = DummyDataset()
201 | dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
202 | 
203 | # Compare training times
204 | model_fp32 = MixedPrecisionModel()
205 | model_amp = MixedPrecisionModel()
206 | 
207 | print("Training with FP32...")
208 | time_fp32 = train_with_amp(model_fp32, dataloader)
209 | print(f"Average epoch time: {time_fp32:.3f}s")
210 | 
211 | print("\nTraining with AMP...")
212 | time_amp = train_with_amp(model_amp, dataloader)
213 | print(f"Average epoch time: {time_amp:.3f}s")
214 | print(f"Speedup: {time_fp32/time_amp:.2f}x")
215 | print()
216 | 
217 | # Example 4: Data Loading Optimization
218 | print("Example 4: Data Loading Optimization")
219 | print("=" * 50)
220 | 
221 | # Optimized dataset with caching
222 | class OptimizedDataset(Dataset):
223 |     def __init__(self, size=1000, cache_size=100):
224 |         self.size = size
225 |         self.cache_size = cache_size
226 |         self.cache = {}
227 |         self.transform = transforms.Compose([
228 |             transforms.RandomHorizontalFlip(),
229 |             transforms.RandomCrop(32, padding=4),
230 |             transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
231 |         ])
232 |     
233 |     def __len__(self):
234 |         return self.size
235 |     
236 |     def __getitem__(self, idx):
237 |         # Simple caching mechanism
238 |         if idx in self.cache:
239 |             return self.cache[idx]
240 |         
241 |         # Simulate data loading
242 |         image = torch.randn(3, 32, 32)
243 |         label = torch.randint(0, 10, (1,)).item()
244 |         
245 |         # Cache recent items
246 |         if len(self.cache) < self.cache_size:
247 |             self.cache[idx] = (image, label)
248 |         
249 |         return image, label
250 | 
251 | # Compare data loading performance
252 | def benchmark_dataloader(dataset, num_workers, pin_memory=False):
253 |     dataloader = DataLoader(
254 |         dataset,
255 |         batch_size=128,
256 |         num_workers=num_workers,
257 |         pin_memory=pin_memory,
258 |         persistent_workers=(num_workers > 0)
259 |     )
260 |     
261 |     start_time = time.time()
262 |     for i, (data, target) in enumerate(dataloader):
263 |         if i >= 50:  # Limit iterations
264 |             break
265 |         # Simulate processing
266 |         data = data.to(device, non_blocking=True)
267 |     
268 |     total_time = time.time() - start_time
269 |     return total_time
270 | 
271 | dataset = OptimizedDataset(5000)
272 | 
273 | print("Data loading benchmark:")
274 | for num_workers in [0, 2, 4]:
275 |     for pin_memory in [False, True]:
276 |         time_taken = benchmark_dataloader(dataset, num_workers, pin_memory)
277 |         print(f"Workers: {num_workers}, Pin memory: {pin_memory} - Time: {time_taken:.3f}s")
278 | print()
279 | 
280 | # Example 5: Model Optimization with TorchScript
281 | print("Example 5: TorchScript Optimization")
282 | print("=" * 50)
283 | 
284 | # Create a model for scripting
285 | class ScriptableModel(nn.Module):
286 |     def __init__(self):
287 |         super().__init__()
288 |         self.conv1 = nn.Conv2d(3, 32, 3)
289 |         self.conv2 = nn.Conv2d(32, 64, 3)
290 |         self.fc = nn.Linear(64 * 6 * 6, 10)
291 |     
292 |     def forward(self, x):
293 |         x = F.relu(self.conv1(x))
294 |         x = F.max_pool2d(x, 2)
295 |         x = F.relu(self.conv2(x))
296 |         x = F.max_pool2d(x, 2)
297 |         x = torch.flatten(x, 1)
298 |         x = self.fc(x)
299 |         return x
300 | 
301 | # Compare scripted vs regular model
302 | model = ScriptableModel().to(device)
303 | model.eval()
304 | 
305 | # Script the model
306 | scripted_model = torch.jit.script(model)
307 | 
308 | # Benchmark
309 | x = torch.randn(100, 3, 32, 32).to(device)
310 | 
311 | # Regular model
312 | torch.cuda.synchronize() if torch.cuda.is_available() else None
313 | start = time.time()
314 | for _ in range(100):
315 |     _ = model(x)
316 | torch.cuda.synchronize() if torch.cuda.is_available() else None
317 | regular_time = time.time() - start
318 | 
319 | # Scripted model
320 | torch.cuda.synchronize() if torch.cuda.is_available() else None
321 | start = time.time()
322 | for _ in range(100):
323 |     _ = scripted_model(x)
324 | torch.cuda.synchronize() if torch.cuda.is_available() else None
325 | scripted_time = time.time() - start
326 | 
327 | print(f"Regular model: {regular_time:.3f}s")
328 | print(f"Scripted model: {scripted_time:.3f}s")
329 | print(f"Speedup: {regular_time/scripted_time:.2f}x")
330 | print()
331 | 
332 | # Example 6: Tensor Operations Optimization
333 | print("Example 6: Tensor Operations Optimization")
334 | print("=" * 50)
335 | 
336 | # Inefficient operations
337 | def inefficient_operation(x):
338 |     result = torch.zeros_like(x)
339 |     for i in range(x.shape[0]):
340 |         for j in range(x.shape[1]):
341 |             result[i, j] = x[i, j] * 2 + 1
342 |     return result
343 | 
344 | # Efficient vectorized operation
345 | def efficient_operation(x):
346 |     return x * 2 + 1
347 | 
348 | # Benchmark
349 | x = torch.randn(1000, 1000).to(device)
350 | 
351 | start = time.time()
352 | _ = inefficient_operation(x)
353 | inefficient_time = time.time() - start
354 | 
355 | start = time.time()
356 | _ = efficient_operation(x)
357 | efficient_time = time.time() - start
358 | 
359 | print(f"Inefficient operation: {inefficient_time:.4f}s")
360 | print(f"Efficient operation: {efficient_time:.4f}s")
361 | print(f"Speedup: {inefficient_time/efficient_time:.0f}x")
362 | print()
363 | 
364 | # Example 7: Memory-Efficient Attention
365 | print("Example 7: Memory-Efficient Attention")
366 | print("=" * 50)
367 | 
368 | class EfficientAttention(nn.Module):
369 |     def __init__(self, dim, num_heads=8, chunk_size=256):
370 |         super().__init__()
371 |         self.num_heads = num_heads
372 |         self.chunk_size = chunk_size
373 |         self.scale = (dim // num_heads) ** -0.5
374 |         
375 |         self.qkv = nn.Linear(dim, dim * 3)
376 |         self.proj = nn.Linear(dim, dim)
377 |     
378 |     def forward(self, x):
379 |         B, N, C = x.shape
380 |         qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
381 |         q, k, v = qkv[0], qkv[1], qkv[2]
382 |         
383 |         # Chunked attention computation
384 |         attn_chunks = []
385 |         for i in range(0, N, self.chunk_size):
386 |             end_idx = min(i + self.chunk_size, N)
387 |             q_chunk = q[:, :, i:end_idx]
388 |             
389 |             # Compute attention for this chunk
390 |             attn = (q_chunk @ k.transpose(-2, -1)) * self.scale
391 |             attn = attn.softmax(dim=-1)
392 |             attn_chunk = attn @ v
393 |             attn_chunks.append(attn_chunk)
394 |         
395 |         # Concatenate chunks
396 |         x = torch.cat(attn_chunks, dim=2)
397 |         x = x.transpose(1, 2).reshape(B, N, C)
398 |         x = self.proj(x)
399 |         return x
400 | 
401 | # Test memory-efficient attention
402 | seq_len = 1024
403 | dim = 512
404 | batch_size = 8
405 | 
406 | attention = EfficientAttention(dim).to(device)
407 | x = torch.randn(batch_size, seq_len, dim).to(device)
408 | 
409 | mem_before = get_memory_usage()
410 | output = attention(x)
411 | mem_used = get_memory_usage() - mem_before
412 | print(f"Memory used by efficient attention: {mem_used:.2f} MB")
413 | print(f"Output shape: {output.shape}")
414 | print()
415 | 
416 | # Example 8: Custom Memory Allocator
417 | print("Example 8: Custom Memory Management")
418 | print("=" * 50)
419 | 
420 | class TensorPool:
421 |     """Simple tensor pool for reusing allocations"""
422 |     def __init__(self):
423 |         self.pool = {}
424 |     
425 |     def get(self, shape, dtype=torch.float32, device='cpu'):
426 |         key = (tuple(shape), dtype, device)
427 |         if key in self.pool and len(self.pool[key]) > 0:
428 |             return self.pool[key].pop()
429 |         return torch.empty(shape, dtype=dtype, device=device)
430 |     
431 |     def release(self, tensor):
432 |         key = (tuple(tensor.shape), tensor.dtype, tensor.device)
433 |         if key not in self.pool:
434 |             self.pool[key] = []
435 |         self.pool[key].append(tensor)
436 |     
437 |     def clear(self):
438 |         self.pool.clear()
439 | 
440 | # Example usage
441 | pool = TensorPool()
442 | 
443 | # Simulate multiple allocations
444 | print("Using tensor pool:")
445 | tensors = []
446 | for i in range(5):
447 |     t = pool.get((100, 100), device=device)
448 |     tensors.append(t)
449 | 
450 | # Release some tensors back to pool
451 | for t in tensors[:3]:
452 |     pool.release(t)
453 | 
454 | # Reuse from pool
455 | print(f"Pool size before reuse: {sum(len(v) for v in pool.pool.values())}")
456 | new_tensors = []
457 | for i in range(3):
458 |     t = pool.get((100, 100), device=device)
459 |     new_tensors.append(t)
460 | print(f"Pool size after reuse: {sum(len(v) for v in pool.pool.values())}")
461 | print()
462 | 
463 | # Best Practices Summary
464 | print("Performance Optimization Best Practices")
465 | print("=" * 50)
466 | print("1. Profile First: Always profile before optimizing")
467 | print("2. Memory Management: Use gradient checkpointing for large models")
468 | print("3. Mixed Precision: Use AMP for faster training")
469 | print("4. Data Loading: Use multiple workers and pin_memory")
470 | print("5. Batch Size: Find optimal batch size for your GPU")
471 | print("6. TorchScript: Script models for production deployment")
472 | print("7. Operator Fusion: Use fused operations when available")
473 | print("8. Distributed Training: Scale across multiple GPUs")
474 | print()
475 | 
476 | # Performance Checklist
477 | print("Performance Optimization Checklist")
478 | print("-" * 30)
479 | checklist = [
480 |     "Profile with torch.profiler",
481 |     "Enable mixed precision training",
482 |     "Optimize data loading pipeline",
483 |     "Use gradient checkpointing for memory",
484 |     "Apply model quantization",
485 |     "Enable CUDNN benchmarking",
486 |     "Use TorchScript for inference",
487 |     "Implement custom CUDA kernels for bottlenecks",
488 |     "Use distributed training for large models",
489 |     "Monitor GPU utilization"
490 | ]
491 | 
492 | for item in checklist:
493 |     print(f"- [ ] {item}")
494 | 
495 | print("\nRemember: Premature optimization is the root of all evil!")
496 | print("Always measure and profile before optimizing.")


--------------------------------------------------------------------------------
/13_custom_extensions/custom_extensions.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Tutorial 13: Custom Extensions (C++ and CUDA)
  3 | ============================================
  4 | 
  5 | This tutorial demonstrates how to create custom C++ and CUDA extensions
  6 | for PyTorch to achieve better performance for specialized operations.
  7 | """
  8 | 
  9 | import torch
 10 | import torch.nn as nn
 11 | import torch.nn.functional as F
 12 | import numpy as np
 13 | import time
 14 | import os
 15 | import sys
 16 | from torch.utils.cpp_extension import load_inline
 17 | 
 18 | # First, let's understand why we might need custom extensions
 19 | print("Why Custom Extensions?")
 20 | print("=" * 50)
 21 | print("1. Performance: C++/CUDA can be much faster than Python")
 22 | print("2. Memory efficiency: Better control over memory allocation")
 23 | print("3. Novel operations: Implement operations not available in PyTorch")
 24 | print("4. Hardware optimization: Leverage specific hardware features")
 25 | print()
 26 | 
 27 | # Example 1: Simple C++ Extension (Inline JIT Compilation)
 28 | print("Example 1: Simple C++ Extension")
 29 | print("-" * 30)
 30 | 
 31 | # C++ source code for a custom ReLU implementation
 32 | cpp_source = '''
 33 | #include <torch/extension.h>
 34 | #include <vector>
 35 | 
 36 | // Forward pass
 37 | torch::Tensor custom_relu_forward(torch::Tensor input) {
 38 |     auto output = torch::zeros_like(input);
 39 |     output = torch::where(input > 0, input, output);
 40 |     return output;
 41 | }
 42 | 
 43 | // Backward pass
 44 | torch::Tensor custom_relu_backward(torch::Tensor grad_output, torch::Tensor input) {
 45 |     auto grad_input = torch::zeros_like(grad_output);
 46 |     grad_input = torch::where(input > 0, grad_output, grad_input);
 47 |     return grad_input;
 48 | }
 49 | 
 50 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
 51 |     m.def("forward", &custom_relu_forward, "Custom ReLU forward");
 52 |     m.def("backward", &custom_relu_backward, "Custom ReLU backward");
 53 | }
 54 | '''
 55 | 
 56 | # Load the extension
 57 | custom_relu_cpp = load_inline(
 58 |     name='custom_relu_cpp',
 59 |     cpp_sources=[cpp_source],
 60 |     functions=['forward', 'backward'],
 61 |     verbose=True,
 62 |     build_directory='./cpp_build'
 63 | )
 64 | 
 65 | # Create a custom autograd Function
 66 | class CustomReLUFunction(torch.autograd.Function):
 67 |     @staticmethod
 68 |     def forward(ctx, input):
 69 |         ctx.save_for_backward(input)
 70 |         return custom_relu_cpp.forward(input)
 71 |     
 72 |     @staticmethod
 73 |     def backward(ctx, grad_output):
 74 |         input, = ctx.saved_tensors
 75 |         return custom_relu_cpp.backward(grad_output, input)
 76 | 
 77 | # Wrap it in a module
 78 | class CustomReLU(nn.Module):
 79 |     def forward(self, input):
 80 |         return CustomReLUFunction.apply(input)
 81 | 
 82 | # Test the custom ReLU
 83 | x = torch.randn(10, 10, requires_grad=True)
 84 | custom_relu = CustomReLU()
 85 | y = custom_relu(x)
 86 | loss = y.sum()
 87 | loss.backward()
 88 | 
 89 | print(f"Input shape: {x.shape}")
 90 | print(f"Output shape: {y.shape}")
 91 | print(f"Gradient computed: {x.grad is not None}")
 92 | print()
 93 | 
 94 | # Example 2: CUDA Extension for Matrix Operations
 95 | print("Example 2: CUDA Extension")
 96 | print("-" * 30)
 97 | 
 98 | # Check if CUDA is available
 99 | if torch.cuda.is_available():
100 |     # CUDA kernel source code
101 |     cuda_source = '''
102 |     #include <torch/extension.h>
103 |     #include <cuda.h>
104 |     #include <cuda_runtime.h>
105 |     #include <vector>
106 | 
107 |     template <typename scalar_t>
108 |     __global__ void custom_matmul_kernel(
109 |         const scalar_t* __restrict__ a,
110 |         const scalar_t* __restrict__ b,
111 |         scalar_t* __restrict__ c,
112 |         int m, int n, int k) {
113 |         
114 |         int row = blockIdx.y * blockDim.y + threadIdx.y;
115 |         int col = blockIdx.x * blockDim.x + threadIdx.x;
116 |         
117 |         if (row < m && col < n) {
118 |             scalar_t sum = 0;
119 |             for (int i = 0; i < k; i++) {
120 |                 sum += a[row * k + i] * b[i * n + col];
121 |             }
122 |             c[row * n + col] = sum;
123 |         }
124 |     }
125 | 
126 |     torch::Tensor custom_matmul_cuda(torch::Tensor a, torch::Tensor b) {
127 |         const int m = a.size(0);
128 |         const int k = a.size(1);
129 |         const int n = b.size(1);
130 |         
131 |         auto c = torch::zeros({m, n}, a.options());
132 |         
133 |         const dim3 threads(16, 16);
134 |         const dim3 blocks((n + threads.x - 1) / threads.x,
135 |                          (m + threads.y - 1) / threads.y);
136 |         
137 |         AT_DISPATCH_FLOATING_TYPES(a.type(), "custom_matmul_cuda", ([&] {
138 |             custom_matmul_kernel<scalar_t><<<blocks, threads>>>(
139 |                 a.data_ptr<scalar_t>(),
140 |                 b.data_ptr<scalar_t>(),
141 |                 c.data_ptr<scalar_t>(),
142 |                 m, n, k
143 |             );
144 |         }));
145 |         
146 |         return c;
147 |     }
148 |     '''
149 |     
150 |     cpp_source_cuda = '''
151 |     #include <torch/extension.h>
152 |     
153 |     torch::Tensor custom_matmul_cuda(torch::Tensor a, torch::Tensor b);
154 |     
155 |     torch::Tensor custom_matmul(torch::Tensor a, torch::Tensor b) {
156 |         // Check inputs
157 |         TORCH_CHECK(a.dim() == 2, "Matrix A must be 2D");
158 |         TORCH_CHECK(b.dim() == 2, "Matrix B must be 2D");
159 |         TORCH_CHECK(a.size(1) == b.size(0), "Matrix dimensions must match for multiplication");
160 |         
161 |         if (a.is_cuda()) {
162 |             return custom_matmul_cuda(a, b);
163 |         } else {
164 |             // CPU implementation
165 |             return torch::matmul(a, b);
166 |         }
167 |     }
168 |     
169 |     PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
170 |         m.def("matmul", &custom_matmul, "Custom matrix multiplication");
171 |     }
172 |     '''
173 |     
174 |     # Note: CUDA compilation requires nvcc and proper setup
175 |     print("CUDA extension example (pseudo-code for demonstration)")
176 |     print("In practice, you would compile this with setuptools or torch.utils.cpp_extension")
177 | else:
178 |     print("CUDA not available, skipping CUDA example")
179 | print()
180 | 
181 | # Example 3: Custom Linear Layer with Fused Operations
182 | print("Example 3: Fused Linear Layer")
183 | print("-" * 30)
184 | 
185 | # C++ code for fused linear layer (bias + activation)
186 | fused_cpp_source = '''
187 | #include <torch/extension.h>
188 | #include <vector>
189 | 
190 | torch::Tensor fused_linear_relu_forward(
191 |     torch::Tensor input,
192 |     torch::Tensor weight,
193 |     torch::Tensor bias) {
194 |     
195 |     // Perform linear transformation
196 |     auto output = torch::matmul(input, weight.t());
197 |     
198 |     // Add bias and apply ReLU in one pass
199 |     output = torch::clamp_min(output + bias, 0);
200 |     
201 |     return output;
202 | }
203 | 
204 | std::vector<torch::Tensor> fused_linear_relu_backward(
205 |     torch::Tensor grad_output,
206 |     torch::Tensor input,
207 |     torch::Tensor weight,
208 |     torch::Tensor output) {
209 |     
210 |     // ReLU backward
211 |     auto relu_grad = torch::where(output > 0, grad_output, torch::zeros_like(grad_output));
212 |     
213 |     // Linear backward
214 |     auto grad_input = torch::matmul(relu_grad, weight);
215 |     auto grad_weight = torch::matmul(relu_grad.t(), input);
216 |     auto grad_bias = relu_grad.sum(0);
217 |     
218 |     return {grad_input, grad_weight, grad_bias};
219 | }
220 | 
221 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
222 |     m.def("forward", &fused_linear_relu_forward, "Fused Linear-ReLU forward");
223 |     m.def("backward", &fused_linear_relu_backward, "Fused Linear-ReLU backward");
224 | }
225 | '''
226 | 
227 | # Load the fused operation
228 | fused_linear_relu = load_inline(
229 |     name='fused_linear_relu',
230 |     cpp_sources=[fused_cpp_source],
231 |     functions=['forward', 'backward'],
232 |     verbose=True,
233 |     build_directory='./cpp_build'
234 | )
235 | 
236 | class FusedLinearReLUFunction(torch.autograd.Function):
237 |     @staticmethod
238 |     def forward(ctx, input, weight, bias):
239 |         output = fused_linear_relu.forward(input, weight, bias)
240 |         ctx.save_for_backward(input, weight, output)
241 |         return output
242 |     
243 |     @staticmethod
244 |     def backward(ctx, grad_output):
245 |         input, weight, output = ctx.saved_tensors
246 |         grad_input, grad_weight, grad_bias = fused_linear_relu.backward(
247 |             grad_output, input, weight, output
248 |         )
249 |         return grad_input, grad_weight, grad_bias
250 | 
251 | class FusedLinearReLU(nn.Module):
252 |     def __init__(self, in_features, out_features):
253 |         super().__init__()
254 |         self.weight = nn.Parameter(torch.randn(out_features, in_features))
255 |         self.bias = nn.Parameter(torch.zeros(out_features))
256 |         
257 |     def forward(self, input):
258 |         return FusedLinearReLUFunction.apply(input, self.weight, self.bias)
259 | 
260 | # Test the fused layer
261 | fused_layer = FusedLinearReLU(100, 50)
262 | x = torch.randn(32, 100)
263 | y = fused_layer(x)
264 | print(f"Fused layer output shape: {y.shape}")
265 | print()
266 | 
267 | # Example 4: Custom Optimizer in C++
268 | print("Example 4: Custom Optimizer")
269 | print("-" * 30)
270 | 
271 | custom_optimizer_source = '''
272 | #include <torch/extension.h>
273 | #include <vector>
274 | 
275 | void custom_sgd_step(
276 |     torch::Tensor param,
277 |     torch::Tensor grad,
278 |     torch::Tensor momentum_buffer,
279 |     float lr,
280 |     float momentum,
281 |     float weight_decay) {
282 |     
283 |     if (weight_decay != 0) {
284 |         grad = grad + weight_decay * param;
285 |     }
286 |     
287 |     if (momentum != 0) {
288 |         momentum_buffer.mul_(momentum).add_(grad);
289 |         param.add_(momentum_buffer, -lr);
290 |     } else {
291 |         param.add_(grad, -lr);
292 |     }
293 | }
294 | 
295 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
296 |     m.def("step", &custom_sgd_step, "Custom SGD step");
297 | }
298 | '''
299 | 
300 | custom_sgd = load_inline(
301 |     name='custom_sgd',
302 |     cpp_sources=[custom_optimizer_source],
303 |     functions=['step'],
304 |     verbose=True,
305 |     build_directory='./cpp_build'
306 | )
307 | 
308 | class CustomSGD:
309 |     def __init__(self, params, lr=0.01, momentum=0.9, weight_decay=0):
310 |         self.params = list(params)
311 |         self.lr = lr
312 |         self.momentum = momentum
313 |         self.weight_decay = weight_decay
314 |         self.momentum_buffers = {}
315 |         
316 |         for p in self.params:
317 |             self.momentum_buffers[p] = torch.zeros_like(p)
318 |     
319 |     def step(self):
320 |         for p in self.params:
321 |             if p.grad is not None:
322 |                 custom_sgd.step(
323 |                     p.data,
324 |                     p.grad.data,
325 |                     self.momentum_buffers[p],
326 |                     self.lr,
327 |                     self.momentum,
328 |                     self.weight_decay
329 |                 )
330 |     
331 |     def zero_grad(self):
332 |         for p in self.params:
333 |             if p.grad is not None:
334 |                 p.grad.zero_()
335 | 
336 | # Example 5: Performance Comparison
337 | print("Example 5: Performance Comparison")
338 | print("-" * 30)
339 | 
340 | def benchmark_operation(name, func, *args, num_runs=1000):
341 |     # Warmup
342 |     for _ in range(10):
343 |         func(*args)
344 |     
345 |     # Benchmark
346 |     if torch.cuda.is_available():
347 |         torch.cuda.synchronize()
348 |     
349 |     start_time = time.time()
350 |     for _ in range(num_runs):
351 |         result = func(*args)
352 |     
353 |     if torch.cuda.is_available():
354 |         torch.cuda.synchronize()
355 |     
356 |     end_time = time.time()
357 |     avg_time = (end_time - start_time) / num_runs * 1000  # Convert to ms
358 |     
359 |     return avg_time, result
360 | 
361 | # Compare custom ReLU with PyTorch ReLU
362 | x = torch.randn(1000, 1000)
363 | pytorch_relu = nn.ReLU()
364 | custom_relu = CustomReLU()
365 | 
366 | pytorch_time, _ = benchmark_operation("PyTorch ReLU", pytorch_relu, x)
367 | custom_time, _ = benchmark_operation("Custom ReLU", custom_relu, x)
368 | 
369 | print(f"PyTorch ReLU: {pytorch_time:.4f} ms")
370 | print(f"Custom ReLU: {custom_time:.4f} ms")
371 | print(f"Speedup: {pytorch_time/custom_time:.2f}x")
372 | print()
373 | 
374 | # Example 6: Building Extensions with setuptools
375 | print("Example 6: Building with setuptools")
376 | print("-" * 30)
377 | 
378 | setup_py_content = '''
379 | from setuptools import setup, Extension
380 | from torch.utils import cpp_extension
381 | 
382 | setup(
383 |     name='custom_ops',
384 |     ext_modules=[
385 |         cpp_extension.CppExtension(
386 |             'custom_ops',
387 |             ['custom_ops.cpp'],
388 |             extra_compile_args=['-O3']
389 |         ),
390 |         cpp_extension.CUDAExtension(
391 |             'custom_cuda_ops',
392 |             ['custom_cuda_ops.cpp', 'custom_cuda_ops_kernel.cu'],
393 |             extra_compile_args={'cxx': ['-O3'],
394 |                               'nvcc': ['-O3', '--use_fast_math']}
395 |         ) if torch.cuda.is_available() else None
396 |     ],
397 |     cmdclass={
398 |         'build_ext': cpp_extension.BuildExtension
399 |     }
400 | )
401 | '''
402 | 
403 | print("Example setup.py for building extensions:")
404 | print(setup_py_content)
405 | print()
406 | 
407 | # Example 7: Memory Management in Extensions
408 | print("Example 7: Memory Management")
409 | print("-" * 30)
410 | 
411 | memory_cpp_source = '''
412 | #include <torch/extension.h>
413 | #include <vector>
414 | 
415 | // Efficient memory pooling example
416 | class MemoryPool {
417 | private:
418 |     std::vector<torch::Tensor> pool;
419 |     std::vector<bool> in_use;
420 |     
421 | public:
422 |     torch::Tensor allocate(std::vector<int64_t> shape, torch::TensorOptions options) {
423 |         // Try to find a suitable tensor in the pool
424 |         for (size_t i = 0; i < pool.size(); i++) {
425 |             if (!in_use[i] && pool[i].sizes() == shape && pool[i].options() == options) {
426 |                 in_use[i] = true;
427 |                 return pool[i];
428 |             }
429 |         }
430 |         
431 |         // Allocate new tensor
432 |         auto tensor = torch::empty(shape, options);
433 |         pool.push_back(tensor);
434 |         in_use.push_back(true);
435 |         return tensor;
436 |     }
437 |     
438 |     void release(torch::Tensor tensor) {
439 |         for (size_t i = 0; i < pool.size(); i++) {
440 |             if (pool[i].data_ptr() == tensor.data_ptr()) {
441 |                 in_use[i] = false;
442 |                 break;
443 |             }
444 |         }
445 |     }
446 | };
447 | 
448 | // Global memory pool
449 | MemoryPool global_pool;
450 | 
451 | torch::Tensor pooled_operation(torch::Tensor input) {
452 |     auto shape = input.sizes().vec();
453 |     auto output = global_pool.allocate(shape, input.options());
454 |     
455 |     // Perform operation
456 |     output.copy_(input);
457 |     output.mul_(2.0);
458 |     
459 |     return output;
460 | }
461 | 
462 | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
463 |     m.def("pooled_operation", &pooled_operation, "Operation with memory pooling");
464 | }
465 | '''
466 | 
467 | print("Memory pooling example shown above")
468 | print("This technique can significantly reduce memory allocation overhead")
469 | print()
470 | 
471 | # Best Practices and Tips
472 | print("Best Practices for Custom Extensions")
473 | print("=" * 50)
474 | print("1. Profile First: Ensure the operation is actually a bottleneck")
475 | print("2. Use Existing Ops: Check if PyTorch already has what you need")
476 | print("3. Memory Layout: Ensure tensors are contiguous when needed")
477 | print("4. Error Handling: Use TORCH_CHECK for input validation")
478 | print("5. Gradient Testing: Always verify gradients with gradcheck")
479 | print("6. Documentation: Document tensor shapes and assumptions")
480 | print("7. Platform Support: Test on different platforms and CUDA versions")
481 | print()
482 | 
483 | # Debugging Tips
484 | print("Debugging Custom Extensions")
485 | print("-" * 30)
486 | print("1. Use print statements in C++ (std::cout)")
487 | print("2. Enable verbose mode in load_inline")
488 | print("3. Use cuda-gdb for CUDA kernels")
489 | print("4. Check tensor continuity with .is_contiguous()")
490 | print("5. Verify shapes and strides match expectations")
491 | print("6. Use torch.autograd.gradcheck for gradient verification")
492 | print()
493 | 
494 | # Summary
495 | print("Summary")
496 | print("=" * 50)
497 | print("Custom extensions allow you to:")
498 | print("- Achieve better performance for specialized operations")
499 | print("- Implement novel algorithms not available in PyTorch")
500 | print("- Leverage hardware-specific optimizations")
501 | print("- Create memory-efficient implementations")
502 | print("\nRemember: Only use custom extensions when necessary!")
503 | print("PyTorch's built-in operations are highly optimized and sufficient for most use cases.")


--------------------------------------------------------------------------------
/12_distributed_training/distributed_training.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | """
  5 | Distributed Training
  6 | 
  7 | This script demonstrates various distributed training techniques in PyTorch,
  8 | including Data Parallel, Distributed Data Parallel, Model Parallel, and FSDP.
  9 | """
 10 | 
 11 | import os
 12 | import time
 13 | import argparse
 14 | import torch
 15 | import torch.nn as nn
 16 | import torch.nn.functional as F
 17 | import torch.distributed as dist
 18 | import torch.multiprocessing as mp
 19 | from torch.nn.parallel import DataParallel, DistributedDataParallel as DDP
 20 | from torch.utils.data import Dataset, DataLoader
 21 | from torch.utils.data.distributed import DistributedSampler
 22 | import matplotlib.pyplot as plt
 23 | import numpy as np
 24 | 
 25 | # Set random seed for reproducibility
 26 | torch.manual_seed(42)
 27 | np.random.seed(42)
 28 | 
 29 | # -----------------------------------------------------------------------------
 30 | # Section 1: Introduction to Distributed Training
 31 | # -----------------------------------------------------------------------------
 32 | 
 33 | def intro_to_distributed_training():
 34 |     """Introduce distributed training concepts."""
 35 |     print("\nSection 1: Introduction to Distributed Training")
 36 |     print("-" * 50)
 37 |     print("Distributed training enables:")
 38 |     print("  - Faster training with multiple GPUs/nodes")
 39 |     print("  - Training larger models that don't fit on single GPU")
 40 |     print("  - Processing larger batch sizes")
 41 |     print("\nTypes of parallelism:")
 42 |     print("  - Data Parallel: Split data, replicate model")
 43 |     print("  - Model Parallel: Split model across devices")
 44 |     print("  - Pipeline Parallel: Split model into stages")
 45 |     print(f"\nCUDA available: {torch.cuda.is_available()}")
 46 |     print(f"Number of GPUs: {torch.cuda.device_count()}")
 47 | 
 48 | # -----------------------------------------------------------------------------
 49 | # Section 2: Sample Dataset and Model
 50 | # -----------------------------------------------------------------------------
 51 | 
 52 | class SyntheticDataset(Dataset):
 53 |     """A synthetic dataset for demonstration."""
 54 |     def __init__(self, size=10000, input_dim=784, num_classes=10):
 55 |         self.size = size
 56 |         self.input_dim = input_dim
 57 |         self.num_classes = num_classes
 58 |         
 59 |     def __len__(self):
 60 |         return self.size
 61 |     
 62 |     def __getitem__(self, idx):
 63 |         # Generate random data
 64 |         data = torch.randn(self.input_dim)
 65 |         label = torch.randint(0, self.num_classes, (1,)).item()
 66 |         return data, label
 67 | 
 68 | class SimpleNet(nn.Module):
 69 |     """A simple neural network for demonstration."""
 70 |     def __init__(self, input_dim=784, hidden_dim=256, num_classes=10):
 71 |         super().__init__()
 72 |         self.fc1 = nn.Linear(input_dim, hidden_dim)
 73 |         self.fc2 = nn.Linear(hidden_dim, hidden_dim)
 74 |         self.fc3 = nn.Linear(hidden_dim, num_classes)
 75 |         self.dropout = nn.Dropout(0.2)
 76 |         
 77 |     def forward(self, x):
 78 |         x = F.relu(self.fc1(x))
 79 |         x = self.dropout(x)
 80 |         x = F.relu(self.fc2(x))
 81 |         x = self.dropout(x)
 82 |         x = self.fc3(x)
 83 |         return x
 84 | 
 85 | # -----------------------------------------------------------------------------
 86 | # Section 3: Data Parallel (DP)
 87 | # -----------------------------------------------------------------------------
 88 | 
 89 | def demonstrate_data_parallel():
 90 |     """Demonstrate Data Parallel training."""
 91 |     print("\nSection 2: Data Parallel (DP)")
 92 |     print("-" * 50)
 93 |     
 94 |     if torch.cuda.device_count() < 2:
 95 |         print("Data Parallel requires at least 2 GPUs. Simulating with CPU...")
 96 |         return
 97 |     
 98 |     # Create model and wrap with DataParallel
 99 |     model = SimpleNet()
100 |     model = DataParallel(model)
101 |     model = model.cuda()
102 |     
103 |     # Create dataset and dataloader
104 |     dataset = SyntheticDataset(size=1000)
105 |     dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
106 |     
107 |     # Loss and optimizer
108 |     criterion = nn.CrossEntropyLoss()
109 |     optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
110 |     
111 |     # Training loop
112 |     print("Training with DataParallel...")
113 |     start_time = time.time()
114 |     
115 |     for epoch in range(2):
116 |         total_loss = 0
117 |         for batch_idx, (data, target) in enumerate(dataloader):
118 |             data, target = data.cuda(), target.cuda()
119 |             
120 |             optimizer.zero_grad()
121 |             output = model(data)
122 |             loss = criterion(output, target)
123 |             loss.backward()
124 |             optimizer.step()
125 |             
126 |             total_loss += loss.item()
127 |             
128 |             if batch_idx % 5 == 0:
129 |                 print(f"  Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
130 |         
131 |         avg_loss = total_loss / len(dataloader)
132 |         print(f"Epoch {epoch} - Average Loss: {avg_loss:.4f}")
133 |     
134 |     elapsed_time = time.time() - start_time
135 |     print(f"Training time: {elapsed_time:.2f} seconds")
136 | 
137 | # -----------------------------------------------------------------------------
138 | # Section 4: Distributed Data Parallel (DDP)
139 | # -----------------------------------------------------------------------------
140 | 
141 | def setup_ddp(rank, world_size):
142 |     """Initialize the distributed environment."""
143 |     os.environ['MASTER_ADDR'] = 'localhost'
144 |     os.environ['MASTER_PORT'] = '12355'
145 |     
146 |     # Initialize process group
147 |     dist.init_process_group("nccl" if torch.cuda.is_available() else "gloo", 
148 |                            rank=rank, world_size=world_size)
149 | 
150 | def cleanup_ddp():
151 |     """Clean up the distributed environment."""
152 |     dist.destroy_process_group()
153 | 
154 | def train_ddp(rank, world_size, num_epochs=2):
155 |     """Training function for DDP."""
156 |     print(f"\nProcess {rank}: Initializing DDP training...")
157 |     setup_ddp(rank, world_size)
158 |     
159 |     # Create model and move to device
160 |     device = torch.device(f'cuda:{rank}' if torch.cuda.is_available() else 'cpu')
161 |     model = SimpleNet().to(device)
162 |     
163 |     # Wrap model with DDP
164 |     if torch.cuda.is_available():
165 |         ddp_model = DDP(model, device_ids=[rank])
166 |     else:
167 |         ddp_model = DDP(model)
168 |     
169 |     # Create dataset with DistributedSampler
170 |     dataset = SyntheticDataset(size=1000)
171 |     sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
172 |     dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)
173 |     
174 |     # Loss and optimizer
175 |     criterion = nn.CrossEntropyLoss()
176 |     optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.001)
177 |     
178 |     # Training loop
179 |     start_time = time.time()
180 |     
181 |     for epoch in range(num_epochs):
182 |         sampler.set_epoch(epoch)  # Important for proper shuffling
183 |         total_loss = 0
184 |         
185 |         for batch_idx, (data, target) in enumerate(dataloader):
186 |             data, target = data.to(device), target.to(device)
187 |             
188 |             optimizer.zero_grad()
189 |             output = ddp_model(data)
190 |             loss = criterion(output, target)
191 |             loss.backward()
192 |             optimizer.step()
193 |             
194 |             total_loss += loss.item()
195 |             
196 |             if rank == 0 and batch_idx % 5 == 0:
197 |                 print(f"  Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
198 |         
199 |         # Synchronize and compute average loss
200 |         avg_loss = total_loss / len(dataloader)
201 |         if rank == 0:
202 |             print(f"Epoch {epoch} - Average Loss: {avg_loss:.4f}")
203 |     
204 |     elapsed_time = time.time() - start_time
205 |     if rank == 0:
206 |         print(f"DDP Training time: {elapsed_time:.2f} seconds")
207 |     
208 |     cleanup_ddp()
209 | 
210 | def demonstrate_ddp():
211 |     """Demonstrate Distributed Data Parallel training."""
212 |     print("\nSection 3: Distributed Data Parallel (DDP)")
213 |     print("-" * 50)
214 |     
215 |     world_size = min(torch.cuda.device_count(), 2) if torch.cuda.is_available() else 2
216 |     
217 |     if world_size < 2:
218 |         print("DDP demonstration requires at least 2 processes.")
219 |         print("Simulating with 2 CPU processes...")
220 |     
221 |     mp.spawn(train_ddp, args=(world_size,), nprocs=world_size, join=True)
222 | 
223 | # -----------------------------------------------------------------------------
224 | # Section 5: Model Parallel
225 | # -----------------------------------------------------------------------------
226 | 
227 | class ModelParallelNet(nn.Module):
228 |     """A model split across multiple devices."""
229 |     def __init__(self, input_dim=784, hidden_dim=256, num_classes=10):
230 |         super().__init__()
231 |         
232 |         # Determine devices
233 |         self.device1 = torch.device('cuda:0' if torch.cuda.device_count() > 0 else 'cpu')
234 |         self.device2 = torch.device('cuda:1' if torch.cuda.device_count() > 1 else 'cpu')
235 |         
236 |         # Split model across devices
237 |         self.fc1 = nn.Linear(input_dim, hidden_dim).to(self.device1)
238 |         self.fc2 = nn.Linear(hidden_dim, hidden_dim).to(self.device2)
239 |         self.fc3 = nn.Linear(hidden_dim, num_classes).to(self.device2)
240 |         
241 |     def forward(self, x):
242 |         x = x.to(self.device1)
243 |         x = F.relu(self.fc1(x))
244 |         
245 |         x = x.to(self.device2)
246 |         x = F.relu(self.fc2(x))
247 |         x = self.fc3(x)
248 |         
249 |         return x
250 | 
251 | def demonstrate_model_parallel():
252 |     """Demonstrate Model Parallel training."""
253 |     print("\nSection 4: Model Parallel")
254 |     print("-" * 50)
255 |     
256 |     if torch.cuda.device_count() < 2:
257 |         print("Model Parallel requires at least 2 GPUs.")
258 |         print("Demonstrating concept with CPU...")
259 |     
260 |     # Create model parallel network
261 |     model = ModelParallelNet()
262 |     
263 |     # Create small dataset for demonstration
264 |     dataset = SyntheticDataset(size=200)
265 |     dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
266 |     
267 |     # Loss and optimizer
268 |     device2 = torch.device('cuda:1' if torch.cuda.device_count() > 1 else 'cpu')
269 |     criterion = nn.CrossEntropyLoss()
270 |     optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
271 |     
272 |     # Training loop
273 |     print("Training with Model Parallel...")
274 |     start_time = time.time()
275 |     
276 |     for epoch in range(2):
277 |         total_loss = 0
278 |         for batch_idx, (data, target) in enumerate(dataloader):
279 |             target = target.to(device2)
280 |             
281 |             optimizer.zero_grad()
282 |             output = model(data)
283 |             loss = criterion(output, target)
284 |             loss.backward()
285 |             optimizer.step()
286 |             
287 |             total_loss += loss.item()
288 |             
289 |             if batch_idx % 5 == 0:
290 |                 print(f"  Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
291 |         
292 |         avg_loss = total_loss / len(dataloader)
293 |         print(f"Epoch {epoch} - Average Loss: {avg_loss:.4f}")
294 |     
295 |     elapsed_time = time.time() - start_time
296 |     print(f"Training time: {elapsed_time:.2f} seconds")
297 | 
298 | # -----------------------------------------------------------------------------
299 | # Section 6: Pipeline Parallel (Conceptual Demo)
300 | # -----------------------------------------------------------------------------
301 | 
302 | def demonstrate_pipeline_parallel():
303 |     """Demonstrate Pipeline Parallel concepts."""
304 |     print("\nSection 5: Pipeline Parallel")
305 |     print("-" * 50)
306 |     print("Pipeline Parallelism splits the model into stages and processes")
307 |     print("micro-batches in a pipeline fashion to improve GPU utilization.")
308 |     print("\nKey concepts:")
309 |     print("  - Model is split into sequential stages")
310 |     print("  - Each stage processes micro-batches")
311 |     print("  - Reduces bubble (idle) time")
312 |     print("  - Can be combined with data parallelism")
313 |     
314 |     # Simple visualization of pipeline scheduling
315 |     print("\nPipeline Schedule Visualization:")
316 |     print("Time →")
317 |     print("GPU0: [F1][F2][F3][F4][B4][B3][B2][B1]")
318 |     print("GPU1:    [F1][F2][F3][F4][B4][B3][B2][B1]")
319 |     print("GPU2:       [F1][F2][F3][F4][B4][B3][B2][B1]")
320 |     print("GPU3:          [F1][F2][F3][F4][B4][B3][B2][B1]")
321 |     print("\nF=Forward, B=Backward, Numbers=Micro-batch IDs")
322 | 
323 | # -----------------------------------------------------------------------------
324 | # Section 7: Fully Sharded Data Parallel (FSDP) Demo
325 | # -----------------------------------------------------------------------------
326 | 
327 | def demonstrate_fsdp_concepts():
328 |     """Demonstrate FSDP concepts."""
329 |     print("\nSection 6: Fully Sharded Data Parallel (FSDP)")
330 |     print("-" * 50)
331 |     print("FSDP enables training of extremely large models by:")
332 |     print("  - Sharding model parameters across GPUs")
333 |     print("  - Sharding optimizer states")
334 |     print("  - Sharding gradients")
335 |     print("  - Optional CPU offloading")
336 |     print("\nMemory savings example:")
337 |     print("  Standard DDP: Each GPU stores full model")
338 |     print("  FSDP: Each GPU stores 1/N of model (N = number of GPUs)")
339 |     
340 |     # Calculate memory savings
341 |     model_size_gb = 7  # Example: 7B parameter model
342 |     num_gpus = 8
343 |     
344 |     print(f"\nExample with {model_size_gb}B parameter model on {num_gpus} GPUs:")
345 |     print(f"  DDP memory per GPU: {model_size_gb} GB")
346 |     print(f"  FSDP memory per GPU: {model_size_gb/num_gpus:.2f} GB")
347 |     print(f"  Memory reduction: {(1 - 1/num_gpus)*100:.1f}%")
348 | 
349 | # -----------------------------------------------------------------------------
350 | # Section 8: Performance Comparison
351 | # -----------------------------------------------------------------------------
352 | 
353 | def plot_performance_comparison():
354 |     """Create a performance comparison visualization."""
355 |     print("\nSection 7: Performance Comparison")
356 |     print("-" * 50)
357 |     
358 |     # Simulated performance data
359 |     methods = ['Single GPU', 'DP (4 GPUs)', 'DDP (4 GPUs)', 'FSDP (4 GPUs)']
360 |     throughput = [100, 320, 380, 350]  # Images/second
361 |     memory_usage = [16, 64, 64, 20]  # GB
362 |     
363 |     fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
364 |     
365 |     # Throughput comparison
366 |     ax1.bar(methods, throughput, color=['blue', 'green', 'orange', 'red'])
367 |     ax1.set_ylabel('Throughput (samples/sec)')
368 |     ax1.set_title('Training Throughput Comparison')
369 |     ax1.set_ylim(0, 400)
370 |     
371 |     # Memory usage comparison
372 |     ax2.bar(methods, memory_usage, color=['blue', 'green', 'orange', 'red'])
373 |     ax2.set_ylabel('Memory Usage (GB)')
374 |     ax2.set_title('Memory Usage Comparison')
375 |     ax2.set_ylim(0, 70)
376 |     
377 |     plt.tight_layout()
378 |     plt.savefig('distributed_training_comparison.png')
379 |     print("Performance comparison saved to 'distributed_training_comparison.png'")
380 | 
381 | # -----------------------------------------------------------------------------
382 | # Section 9: Best Practices
383 | # -----------------------------------------------------------------------------
384 | 
385 | def print_best_practices():
386 |     """Print distributed training best practices."""
387 |     print("\nSection 8: Best Practices")
388 |     print("-" * 50)
389 |     print("1. Data Loading:")
390 |     print("   - Use DistributedSampler for DDP")
391 |     print("   - Pin memory for GPU training")
392 |     print("   - Use multiple workers for data loading")
393 |     print("\n2. Gradient Synchronization:")
394 |     print("   - Use gradient accumulation for large batches")
395 |     print("   - Consider gradient compression for bandwidth")
396 |     print("\n3. Checkpointing:")
397 |     print("   - Save checkpoints only from rank 0")
398 |     print("   - Use torch.save with map_location for loading")
399 |     print("\n4. Debugging:")
400 |     print("   - Set TORCH_DISTRIBUTED_DEBUG=DETAIL")
401 |     print("   - Use torch.distributed.barrier() for synchronization")
402 |     print("   - Monitor GPU utilization and memory")
403 |     print("\n5. Performance:")
404 |     print("   - Profile with torch.profiler")
405 |     print("   - Use mixed precision training")
406 |     print("   - Overlap computation and communication")
407 | 
408 | # -----------------------------------------------------------------------------
409 | # Main Function
410 | # -----------------------------------------------------------------------------
411 | 
412 | def main():
413 |     """Main function to run all demonstrations."""
414 |     parser = argparse.ArgumentParser(description='Distributed Training Tutorial')
415 |     parser.add_argument('--distributed', action='store_true', 
416 |                        help='Run distributed training examples')
417 |     args = parser.parse_args()
418 |     
419 |     print("=" * 70)
420 |     print("Distributed Training Tutorial")
421 |     print("=" * 70)
422 |     
423 |     # Run demonstrations
424 |     intro_to_distributed_training()
425 |     
426 |     if torch.cuda.device_count() >= 2:
427 |         demonstrate_data_parallel()
428 |     else:
429 |         print("\nSkipping Data Parallel demo (requires 2+ GPUs)")
430 |     
431 |     if args.distributed:
432 |         demonstrate_ddp()
433 |     else:
434 |         print("\nSkipping DDP demo (use --distributed flag to run)")
435 |     
436 |     if torch.cuda.device_count() >= 2:
437 |         demonstrate_model_parallel()
438 |     else:
439 |         print("\nSkipping Model Parallel demo (requires 2+ GPUs)")
440 |     
441 |     demonstrate_pipeline_parallel()
442 |     demonstrate_fsdp_concepts()
443 |     plot_performance_comparison()
444 |     print_best_practices()
445 |     
446 |     print("\n" + "=" * 70)
447 |     print("Tutorial completed!")
448 |     print("=" * 70)
449 | 
450 | if __name__ == "__main__":
451 |     main()


--------------------------------------------------------------------------------
/04_training_neural_networks/README.md:
--------------------------------------------------------------------------------
  1 | # Training Neural Networks in PyTorch: A Comprehensive Guide
  2 | 
  3 | This tutorial provides an in-depth guide to training neural networks effectively using PyTorch. We will cover everything from the fundamental training loop to advanced techniques for optimization, regularization, and monitoring to help you build robust and high-performing models.
  4 | 
  5 | ## Table of Contents
  6 | 1. [Introduction to Neural Network Training](#introduction-to-neural-network-training)
  7 |    - The Goal: Learning from Data
  8 |    - Core Components Revisited: Model, Data, Loss, Optimizer
  9 |    - The Iterative Process: Epochs and Batches
 10 | 2. [Preparing Your Data with `Dataset` and `DataLoader`](#preparing-your-data-with-dataset-and-dataloader)
 11 |    - `torch.utils.data.Dataset` Customization
 12 |    - `torch.utils.data.DataLoader` for Batching and Shuffling
 13 |    - Data Augmentation and Transformation
 14 | 3. [The Essential Training Loop](#the-essential-training-loop)
 15 |    - Setting the Model to Training Mode (`model.train()`)
 16 |    - Iterating Through Data Batches
 17 |    - Zeroing Gradients (`optimizer.zero_grad()`)
 18 |    - Forward Pass: Getting Predictions
 19 |    - Calculating the Loss
 20 |    - Backward Pass: Computing Gradients (`loss.backward()`)
 21 |    - Optimizer Step: Updating Weights (`optimizer.step()`)
 22 |    - Tracking Metrics (Loss, Accuracy)
 23 | 4. [Validation: Evaluating Model Performance](#validation-evaluating-model-performance)
 24 |    - Importance of a Validation Set
 25 |    - Train-Validation-Test Splits
 26 |    - Setting the Model to Evaluation Mode (`model.eval()`)
 27 |    - Disabling Gradient Computation (`torch.no_grad()`)
 28 |    - Implementing a Validation Loop
 29 |    - K-Fold Cross-Validation (Concept and Use Case)
 30 | 5. [Saving and Loading Models](#saving-and-loading-models)
 31 |    - Saving/Loading Entire Model vs. State Dictionary (`state_dict`)
 32 |    - Saving `state_dict` (Recommended)
 33 |    - Loading `state_dict`
 34 |    - Saving Checkpoints During Training (for Resuming)
 35 | 6. [Hyperparameter Tuning Strategies](#hyperparameter-tuning-strategies)
 36 |    - What are Hyperparameters?
 37 |    - Common Hyperparameters: Learning Rate, Batch Size, Network Architecture, Regularization Strength
 38 |    - Manual Search vs. Grid Search vs. Random Search
 39 |    - Advanced Tools: Optuna, Ray Tune, Weights & Biases Sweeps (Conceptual Overview)
 40 | 7. [Learning Rate Scheduling](#learning-rate-scheduling)
 41 |    - Why Adjust Learning Rate During Training?
 42 |    - Common Schedulers in `torch.optim.lr_scheduler`:
 43 |      - `StepLR`: Decay by gamma every step_size epochs.
 44 |      - `MultiStepLR`: Decay by gamma at specified milestones.
 45 |      - `ExponentialLR`: Decay by gamma every epoch.
 46 |      - `CosineAnnealingLR`: Cosine-shaped decay.
 47 |      - `ReduceLROnPlateau`: Reduce LR when a metric stops improving.
 48 |    - Integrating Schedulers into the Training Loop
 49 | 8. [Regularization Techniques to Prevent Overfitting](#regularization-techniques-to-prevent-overfitting)
 50 |    - What is Overfitting?
 51 |    - L1 and L2 Regularization (Weight Decay in Optimizers)
 52 |    - Dropout (`nn.Dropout`)
 53 |    - Early Stopping
 54 |    - Data Augmentation (as a form of regularization)
 55 | 9. [Gradient Clipping](#gradient-clipping)
 56 |    - Problem: Exploding Gradients
 57 |    - `torch.nn.utils.clip_grad_norm_`
 58 |    - `torch.nn.utils.clip_grad_value_`
 59 |    - When and How to Use It
 60 | 10. [Weight Initialization Strategies](#weight-initialization-strategies)
 61 |     - Importance of Proper Initialization
 62 |     - Common Methods in `torch.nn.init`:
 63 |       - Xavier/Glorot Initialization (`nn.init.xavier_uniform_`, `nn.init.xavier_normal_`)
 64 |       - Kaiming/He Initialization (`nn.init.kaiming_uniform_`, `nn.init.kaiming_normal_`)
 65 |       - Initializing Biases (e.g., to zero or small constants)
 66 |     - Applying Initialization to a Model
 67 | 11. [Batch Normalization (`nn.BatchNorm1d`, `nn.BatchNorm2d`)](#batch-normalization-nnbatchnorm1d-nnbatchnorm2d)
 68 |     - How it Works: Normalizing Activations within a Batch
 69 |     - Benefits: Faster Convergence, Regularization Effect, Reduced Sensitivity to Initialization
 70 |     - Usage: `model.train()` vs. `model.eval()` behavior
 71 | 12. [Monitoring Training with TensorBoard](#monitoring-training-with-tensorboard)
 72 |     - `torch.utils.tensorboard.SummaryWriter`
 73 |     - Logging Scalars: Loss, Accuracy, Learning Rate
 74 |     - Logging Histograms: Weights, Gradients
 75 |     - Logging Images, Model Graphs (Conceptual)
 76 | 13. [A Complete Training Pipeline Example](#a-complete-training-pipeline-example)
 77 |     - Structuring the Code: Setup, Data Loading, Model, Training, Evaluation
 78 |     - Putting It All Together (Conceptual Flow)
 79 | 
 80 | ## Introduction to Neural Network Training
 81 | 
 82 | - **The Goal: Learning from Data**
 83 |   The primary objective of training a neural network is to enable it to learn patterns and relationships from a given dataset. This learned knowledge allows the model to make accurate predictions or classifications on new, unseen data.
 84 | - **Core Components Revisited:**
 85 |   - **Model:** The neural network architecture (e.g., an MLP, CNN) defined using `nn.Module`.
 86 |   - **Data:** Input features and corresponding target labels, typically split into training, validation, and test sets.
 87 |   - **Loss Function:** A function that measures the discrepancy between the model's predictions and the true target values (e.g., `nn.CrossEntropyLoss` for classification, `nn.MSELoss` for regression).
 88 |   - **Optimizer:** An algorithm (e.g., SGD, Adam from `torch.optim`) that adjusts the model's parameters (weights and biases) to minimize the loss function.
 89 | - **The Iterative Process: Epochs and Batches**
 90 |   - **Epoch:** One complete pass through the entire training dataset.
 91 |   - **Batch:** The training dataset is often divided into smaller subsets called batches. The model's weights are updated after processing each batch. This makes training more computationally manageable and can lead to faster convergence.
 92 | 
 93 | ## Preparing Your Data with `Dataset` and `DataLoader`
 94 | 
 95 | PyTorch provides convenient utilities for handling data.
 96 | 
 97 | - **`torch.utils.data.Dataset`:** An abstract class for representing a dataset. You can create custom datasets by subclassing it and implementing `__len__` (to return the size of the dataset) and `__getitem__` (to support indexing and return a single sample).
 98 | - **`torch.utils.data.DataLoader`:** Wraps a `Dataset` and provides an iterable over the dataset. It handles batching, shuffling, and parallel data loading.
 99 | - **Data Augmentation and Transformation:** Often applied within the `Dataset` or via `torchvision.transforms` to increase data diversity and improve model generalization.
100 | 
101 | ```python
102 | from torch.utils.data import Dataset, DataLoader
103 | import torchvision.transforms as transforms
104 | 
105 | class MyCustomDataset(Dataset):
106 |     def __init__(self, data, targets, transform=None):
107 |         self.data = data
108 |         self.targets = targets
109 |         self.transform = transform
110 | 
111 |     def __len__(self):
112 |         return len(self.data)
113 | 
114 |     def __getitem__(self, idx):
115 |         sample = self.data[idx]
116 |         target = self.targets[idx]
117 |         if self.transform:
118 |             sample = self.transform(sample)
119 |         return sample, target
120 | 
121 | # Example usage:
122 | # train_data, train_targets = ...
123 | # train_dataset = MyCustomDataset(train_data, train_targets, transform=transforms.ToTensor())
124 | # train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
125 | ```
126 | 
127 | ## The Essential Training Loop
128 | 
129 | The core of neural network training. Here's a breakdown of a typical single epoch:
130 | 
131 | ```python
132 | # Assume model, train_loader, criterion, optimizer, device are defined
133 | # model = YourModel().to(device)
134 | # criterion = nn.CrossEntropyLoss()
135 | # optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
136 | 
137 | def train_one_epoch(model, train_loader, criterion, optimizer, device):
138 |     model.train()  # 1. Set model to training mode
139 |     running_loss = 0.0
140 |     correct_predictions = 0
141 |     total_samples = 0
142 | 
143 |     # 2. Iterate through data batches
144 |     for batch_idx, (inputs, targets) in enumerate(train_loader):
145 |         inputs, targets = inputs.to(device), targets.to(device)
146 | 
147 |         # 3. Zeroing gradients
148 |         optimizer.zero_grad()
149 | 
150 |         # 4. Forward pass: Getting predictions
151 |         outputs = model(inputs)
152 | 
153 |         # 5. Calculating the loss
154 |         loss = criterion(outputs, targets)
155 | 
156 |         # 6. Backward pass: Computing gradients
157 |         loss.backward()
158 | 
159 |         # 7. Optimizer step: Updating weights
160 |         optimizer.step()
161 | 
162 |         # 8. Tracking metrics
163 |         running_loss += loss.item() * inputs.size(0)
164 |         _, predicted_classes = outputs.max(1)
165 |         total_samples += targets.size(0)
166 |         correct_predictions += predicted_classes.eq(targets).sum().item()
167 | 
168 |     epoch_loss = running_loss / total_samples
169 |     epoch_accuracy = correct_predictions / total_samples
170 |     return epoch_loss, epoch_accuracy
171 | ```
172 | 
173 | ## Validation: Evaluating Model Performance
174 | 
175 | Validation helps monitor overfitting and assess how well the model generalizes to unseen data.
176 | 
177 | - **Train-Validation-Test Splits:**
178 |   - **Training Set:** Used to train the model.
179 |   - **Validation Set:** Used to tune hyperparameters and make decisions about the training process (e.g., early stopping).
180 |   - **Test Set:** Used for a final, unbiased evaluation of the trained model. Should only be used once.
181 | - **`model.eval()`:** Sets the model to evaluation mode. This is important for layers like Dropout and BatchNorm, which behave differently during training and evaluation.
182 | - **`torch.no_grad()`:** A context manager that disables gradient computation, reducing memory usage and speeding up inference during validation/testing.
183 | 
184 | ```python
185 | # Assume model, val_loader, criterion, device are defined
186 | def validate_one_epoch(model, val_loader, criterion, device):
187 |     model.eval()  # 1. Set model to evaluation mode
188 |     running_loss = 0.0
189 |     correct_predictions = 0
190 |     total_samples = 0
191 | 
192 |     with torch.no_grad():  # 2. Disable gradient computation
193 |         for inputs, targets in val_loader:
194 |             inputs, targets = inputs.to(device), targets.to(device)
195 |             outputs = model(inputs)
196 |             loss = criterion(outputs, targets)
197 | 
198 |             running_loss += loss.item() * inputs.size(0)
199 |             _, predicted_classes = outputs.max(1)
200 |             total_samples += targets.size(0)
201 |             correct_predictions += predicted_classes.eq(targets).sum().item()
202 | 
203 |     epoch_loss = running_loss / total_samples
204 |     epoch_accuracy = correct_predictions / total_samples
205 |     return epoch_loss, epoch_accuracy
206 | ```
207 | 
208 | - **K-Fold Cross-Validation:** For smaller datasets, split the data into K folds. Train on K-1 folds and validate on the remaining fold. Repeat K times, averaging the performance metrics. Provides a more robust estimate of model performance.
209 | 
210 | ## Saving and Loading Models
211 | 
212 | It's essential to save your trained model for later use or to resume training.
213 | 
214 | - **Saving/Loading `state_dict` (Recommended):** This saves only the model's learnable parameters (weights and biases).
215 |   ```python
216 |   # Saving
217 |   # torch.save(model.state_dict(), 'model_weights.pth')
218 | 
219 |   # Loading
220 |   # model_architecture = YourModel(*args, **kwargs) # Recreate model instance first
221 |   # model_architecture.load_state_dict(torch.load('model_weights.pth'))
222 |   # model_architecture.to(device) # Don't forget to move to device
223 |   # model_architecture.eval() # Set to eval mode if using for inference
224 |   ```
225 | - **Saving Checkpoints:** Save model `state_dict`, optimizer `state_dict`, epoch, loss, etc., to resume training.
226 |   ```python
227 |   # checkpoint = {
228 |   #     'epoch': epoch,
229 |   #     'model_state_dict': model.state_dict(),
230 |   #     'optimizer_state_dict': optimizer.state_dict(),
231 |   #     'loss': loss,
232 |   #     # any other metrics
233 |   # }
234 |   # torch.save(checkpoint, 'checkpoint.pth')
235 |   ```
236 | 
237 | ## Hyperparameter Tuning Strategies
238 | 
239 | - **Common Hyperparameters:** Learning rate, batch size, number of epochs, optimizer choice, hidden layer sizes, activation functions, dropout rate, weight decay.
240 | - **Manual Search:** Experimenting based on intuition and observation.
241 | - **Grid Search:** Defining a grid of hyperparameter values and trying all combinations. Computationally expensive.
242 | - **Random Search:** Randomly sampling hyperparameter combinations. Often more efficient than grid search.
243 | - **Advanced Tools:** Libraries like Optuna, Ray Tune, or services like Weights & Biases Sweeps automate the search process using more sophisticated algorithms (e.g., Bayesian optimization).
244 | 
245 | ## Learning Rate Scheduling
246 | 
247 | Dynamically adjusting the learning rate can lead to better performance and faster convergence.
248 | 
249 | - **`torch.optim.lr_scheduler`:** Provides various schedulers.
250 |   ```python
251 |   from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau
252 | 
253 |   # optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
254 |   # scheduler_steplr = StepLR(optimizer, step_size=10, gamma=0.1) # Reduce LR by factor of 0.1 every 10 epochs
255 |   # scheduler_plateau = ReduceLROnPlateau(optimizer, 'min', patience=5, factor=0.5) # Reduce if val_loss plateaus
256 | 
257 |   # In training loop, after optimizer.step():
258 |   # if isinstance(scheduler, ReduceLROnPlateau):
259 |   #     scheduler.step(validation_loss) # For ReduceLROnPlateau
260 |   # else:
261 |   #     scheduler.step() # For most other schedulers
262 |   ```
263 | 
264 | ## Regularization Techniques to Prevent Overfitting
265 | 
266 | Overfitting occurs when a model learns the training data too well, including its noise, and performs poorly on unseen data.
267 | 
268 | - **L1 and L2 Regularization (Weight Decay):** Add a penalty to the loss function based on the magnitude of model weights. L2 regularization (weight decay) is common and can be added directly in PyTorch optimizers:
269 |   `optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)`
270 | - **Dropout (`nn.Dropout`):** Randomly zeros out a fraction of neuron outputs during training, forcing the network to learn more robust features.
271 | - **Early Stopping:** Monitor validation loss and stop training if it doesn't improve for a certain number of epochs.
272 | - **Data Augmentation:** Artificially increasing the size and diversity of the training dataset.
273 | 
274 | ## Gradient Clipping
275 | 
276 | Helps prevent exploding gradients (gradients becoming very large), which can destabilize training, especially in RNNs.
277 | 
278 | - **`torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)`:** Clips the L2 norm of all gradients together.
279 | - **`torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)`:** Clips individual gradient values to be within `[-clip_value, clip_value]`.
280 |   Call this *after* `loss.backward()` and *before* `optimizer.step()`.
281 | 
282 | ## Weight Initialization Strategies
283 | 
284 | Proper initialization helps prevent vanishing or exploding gradients and speeds up convergence.
285 | 
286 | - **`torch.nn.init`:** Contains various initialization functions.
287 |   - **Xavier/Glorot:** Good for layers with Sigmoid/Tanh activations. (`nn.init.xavier_uniform_`, `nn.init.xavier_normal_`)
288 |   - **Kaiming/He:** Good for layers with ReLU activations. (`nn.init.kaiming_uniform_`, `nn.init.kaiming_normal_`)
289 | 
290 | ```python
291 | # def initialize_weights(m):
292 | #     if isinstance(m, nn.Linear):
293 | #         nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
294 | #         if m.bias is not None:
295 | #             nn.init.constant_(m.bias, 0)
296 | # model.apply(initialize_weights)
297 | ```
298 | 
299 | ## Batch Normalization (`nn.BatchNorm1d`, `nn.BatchNorm2d`)
300 | 
301 | Normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. Speeds up training and offers some regularization.
302 | Remember to use `model.train()` and `model.eval()` appropriately as BatchNorm layers behave differently.
303 | 
304 | ## Monitoring Training with TensorBoard
305 | 
306 | TensorBoard is a powerful visualization toolkit for inspecting and understanding your model's training process.
307 | 
308 | - **`torch.utils.tensorboard.SummaryWriter`:** The main class for logging data to TensorBoard.
309 |   ```python
310 |   # from torch.utils.tensorboard import SummaryWriter
311 |   # writer = SummaryWriter('runs/my_experiment_name')
312 |   # writer.add_scalar('Training Loss', epoch_loss, global_step=epoch)
313 |   # writer.add_scalar('Validation Accuracy', epoch_accuracy, global_step=epoch)
314 |   # writer.add_histogram('fc1.weights', model.fc1.weight, global_step=epoch)
315 |   # writer.close()
316 |   ```
317 | 
318 | ## A Complete Training Pipeline Example
319 | 
320 | A full pipeline involves integrating data loading, model definition, the training loop, validation, schedulers, saving, and monitoring. The accompanying Python script (`training_neural_networks.py`) will provide a concrete example of these components working together.
321 | 
322 | ## Running the Tutorial
323 | 
324 | To run the Python script associated with this tutorial:
325 | ```bash
326 | python training_neural_networks.py
327 | ```
328 | We recommend you manually create a `training_neural_networks.ipynb` notebook and copy the code from the Python script into it for an interactive experience, as direct notebook creation has been problematic.
329 | 
330 | ## Prerequisites
331 | - Python 3.7+
332 | - PyTorch 1.10+
333 | - NumPy
334 | - Matplotlib (for visualization)
335 | - Scikit-learn (optional, for utilities like KFold or datasets)
336 | - TensorBoard (optional, for advanced monitoring: `pip install tensorboard`)
337 | 
338 | ## Related Tutorials
339 | 1. [PyTorch Basics](../01_pytorch_basics/README.md)
340 | 2. [Neural Networks Fundamentals](../02_neural_networks_fundamentals/README.md)
341 | 3. [Automatic Differentiation](../03_automatic_differentiation/README.md)
342 | 4. [Data Loading and Preprocessing](../05_data_loading_preprocessing/README.md) 


--------------------------------------------------------------------------------