├── LICENSE ├── README.md ├── compilers.md ├── knowledge_distillation_overview.md ├── knowledge_distillation_resources.md ├── pruning-overview.md ├── pruning-resources.md ├── quantization-overview.md ├── quantization-resources.md └── weekly-insights.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Nebuly 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Exploring AI optimization 2 | This library aims to become a curated list of awesome material (cool papers, tutorials, courses, etc.) on optimization techniques to make artificial intelligence faster and more efficient 🚀 3 | 4 | - [x] [Weekly insights from the best papers on AI](https://github.com/nebuly-ai/learning-AI-optimization/blob/main/weekly-insights.md) 5 | - [x] **Quantization** [[Overview](https://github.com/nebuly-ai/learning-AI-optimization/blob/main/quantization-overview.md), [Resources](https://github.com/nebuly-ai/learning-AI-optimization/blob/main/quantization-resources.md)] 6 | - [x] **Pruning** [[Overview](https://github.com/nebuly-ai/learning-AI-optimization/blob/main/pruning-overview.md), [Resources](https://github.com/nebuly-ai/exploring-AI-optimization/blob/main/pruning-resources.md)] 7 | - [ ] **Distillation** [Overview, Resources] 8 | - [ ] **Deep learning compilers** [Overview, Resources] 9 | 10 | And many others. Do you have any idea on material to expand the list? We are welcoming contributions! And don't forget to leave a star if you found this library insightful ⭐ 11 | 12 | 13 | # Contribute 14 | 15 | We welcome support from the open-source community in any form 🥰 16 | 17 | We have recently published this library. We will continue to update and upgrade it with great material on how to accelerate artificial intelligence models and make them ever more efficient. We also hope you will also learn with us and help us keep the library populated with the best quality content. 18 | 19 | Open an [issue](https://github.com/nebuly-ai/learning-AI-optimization/issues) or drop a message in the community channel to 20 | - Add topics you would like to be covered 21 | - Ask any questions 22 | - Report a typo 23 | - Propose to include your research work 24 | 25 | 26 | Fork the library, make the changes, and send us a [pull request](https://github.com/nebuly-ai/learning-AI-optimization/pulls) in case you want to 27 | - Write new content on AI optimization 28 | - Correct or improve some sections 29 | 30 | 31 | Currently, the main contributor to the library is [Pier](https://www.linkedin.com/in/pierpaolo-sorbellini-64603012b/), and the library would not have been possible without [Diego](https://www.linkedin.com/in/diego-fiori-b64b3016a/)'s interesting insights and topic suggestions and support from the community. 32 | -------------------------------------------------------------------------------- /compilers.md: -------------------------------------------------------------------------------- 1 |

2 | Join the community • 3 | Contribute to the library 4 |

5 | 6 | 7 | 8 | 9 | # Compilers 10 | 11 | Find below an overview on compilers, while [here]() a curated collection of resources on the topic. 12 | 13 | Deep Learning Compilers are custom compilers that are used to compile DeepLearning Models. 14 | 15 | To compile means to convert the high level code in low level code containing machine level instructions that can be DIRECTLY used by the given Hardware Platform. 16 | 17 | One of the main advantages is trivial: usually deep learning models are python-based meaning that the code needs to run on top of the python interpreter that is flexible but slow. 18 | 19 | Libraries such as Pytorch, Numpy and many others rely on a structure composed by two pieces: 20 | 21 | - A C++ compiled library that comprises all the function that are compute intensive (i.e. the actual algebraic operations). 22 | - A Python libraries that is binded with the C++ library that allow to calls compiled C++ functions. 23 | 24 | with this strategy the overhead of the python interpreter can be limited but not eliminated, allowing to merge the efficient C++ compiled code with the flexibility and ease to use of Python. 25 | 26 | Moreover it is possible to link compiled CUDA kernels to exploit accelerators such as GPUs. 27 | 28 | Deep Learning Compilers instead take as input your python model and compile it as whole returning an object that can be treated as a function. This enables a compiled model to be generally faster as is, just because it do not rely on the Python Interpreter. 29 | 30 | Furthermore, DeepLearning specialized compilers, comprise a set of optimizations designed to enhance the efficiency of a model that are custom-made for DL. These features will be the focus of the next paragraphs, starting from what are the main components of a deep learning compiler and following with the most common optimizations that are build on top of them. 31 | 32 | The main Deep Learning Compilers are: 33 | 34 | - TVM (Apache Open Source) 35 | - Glow / Tensor Comprehension (Meta) 36 | - XLA (Google) 37 | - ONNX Runtime 38 | - OpenVino (Intel) 39 | - Tensor-RT (NVIDIA) 40 | 41 | ## Compiler's structure 42 | 43 | At a high level, the main building block of a compiler is the IR, which stands for Intermediate Representation. Basically, the IR is a way to abstract the code to be compiled, so that some features of the code can be analyzed and optimized. To give an example, the first intermediate representation of a deeplearning compiler is usually the model graph: representing the model as a graph of algebraic operations allows dependencies between operations, data flow, and sometimes control flow to be highlighted. 44 | 45 | The compiler is essentially made up of many overlapping IRs. The IRs are hierarchical, the higher-level IRs usually affect only the model, while the lower-level IRs usually also take into account the hardware platform on which the code is to run. IRs are pipelined: the model is first fed by the higher-level IRs and then is gradually converted to lower-level IRs. 46 | 47 | Where are the optimizations performed? Everywhere! Each IR is designed to better highlight some optimization opportunity, and each IR has its own "steps" that modify the current representation of the model, increasing the performance of the original model. To make this concept tangible, let us imagine the graph of our model in the first high-level IR; this graph can be modified by moving or joining nodes, removing unused branches, and so on, all of these graph optimizations are usually steps in the first intermediate representation. 48 | 49 | ## High-level IRs 50 | 51 | High-level IRs must represent: 52 | 53 | - tensor computation 54 | - data (input, weights, and intermediate data). 55 | 56 | The design challenge of high-level IRs is the abstraction capability of computation and control flow, which can capture and express different DL models. The goal of high-level IRs is to establish the control flow and dependency between operators and data, as well as to provide an interface for graph-level optimizations. 57 | 58 | High-level IRs are: 59 | 60 | - IRs based on Data Dependence Draphs (DDG): used for generic compilers. Used to perform optimizations such as: 61 | - elimination of common subexpressions (CSE) 62 | - elimination of dead code (DCE) 63 | - IR based on directed acyclic graphs (DAG): nodes of a DAG represent atomic DL operators (convolution, pooling, etc.) and edges represent tensors. It is used to analyze the relationships and dependencies among various operators and use them to guide optimizations. 64 | - IR based Let-Binding: is a method to resolve semantic ambiguity by offering let expression. A let node is generated, which points to the operator and the variable in the expression, instead of simply constructing a computational relationship between the variables like a DAG. Among DL compilers, TVM's Relay IR adopts both DAG-based IR and let-binding-based IR. 65 | 66 | ### Representation of tensors. 67 | 68 | Different graph IRs have different ways of representing calculus on tensors. 69 | 70 | - Function-based (GLow, XLA, OpenVino): The function-based representation provides only encapsulated operators. Instructions are organized in three levels: 71 | - the entire program 72 | - a function 73 | - the operation. 74 | - Lambda expression (TVM): describes computation through binding and substitution of variables. 75 | - Einstein notation (Glow): indices of temporary variables need not be defined. IR can understand the actual expression based on the occurrence of undefined variables based on Einstein notation. Operators must be associative and commutative, making further parallelization possible. 76 | 77 | ### Data representation 78 | 79 | DL compilers can represent tensor data directly with memory pointers or, more flexibly, with placeholders. 80 | 81 | - Placeholder: A placeholder is simply a variable with explicit shape information (size in each dimension), which will be populated with values at a later stage of computation. It allows programmers to describe operations and construct the computation graph without having to indicate the exact data elements. Allows the shape of inputs/outputs and other corresponding intermediate data to be changed using placeholders without changing the definition of the calculation. 82 | - Unknown (dynamic) shape representation: The unknown dimension is usually supported when declaring placeholders. The unknown shape representation is needed to support the dynamic model. Constraint inference and dimension checking should be relaxed and, in addition, an additional mechanism should be implemented to ensure memory validity. 83 | - Data layout: Data layout describes the organization of a tensor in memory and is usually a mapping from logical indices to memory indices. The data layout usually includes the sequence of dimensions. Combining data layout information with operators rather than tensors allows intuitive implementation for some operators and reduces compilation overhead. 84 | - Boundary inference: Boundary inference is applied to determine the limit of iterators. Boundary inference is usually performed recursively or iteratively, based on the computation graph and known placeholders. Once the limit of the root iterator is determined based on the forms of the placeholders, the other iterators can be inferred based on the recursive relations. 85 | 86 | ### Types of operations. 87 | 88 | In addition to the standard algebraic operators in common use, there are others that are implemented in the DL framework: 89 | 90 | - Broadcast: Broadcast operators can replicate data and generate new data with compatible form. Without broadcast operators, the shapes of input tensors are more constrained. 91 | - Control flow: Control flow is required when representing complex and flexible models. Models with recurring relationships and data-dependent conditional execution require control flow. 92 | - Derivation: The derivation operator of an Op operator takes as input the output gradients and input data of Op, then computes the gradient of Op. Programmers can use these derivation operators to construct derivatives of custom operators. In particular, DL compilers that cannot support derivative operators fail to provide model training capability. 93 | - Custom operators: Allows programmers to define their own operators for a particular purpose. 94 | 95 | 96 | ## Front-End Optimizations 97 | 98 | ### **Node-level optimizations:** compute graph nodes are coarse enough to allow optimizations within a single node. 99 | 100 | - **Nop Elimination**: removes no-op instructions that take up a small amount of space but do not specify any operations. 101 | - **Zero-Dim-Tensor Elimination**: is responsible for removing unnecessary operations whose inputs are zero-dimensional tensors. 102 | 103 | ### **Block Level optimizations** 104 | 105 | - **Algebraic simplification**: These optimizations consider a sequence of nodes and exploit the commutativity, associativity and distributivity of different types of nodes to simplify the computation. They can also be applied to DL-specific operators (e.g., reshape, transpose, and pooling). Operators can be reordered and sometimes eliminated. Consists of: 106 | - Algebraic identification 107 | - Force reduction: by which we can replace more expensive operators with cheaper ones; 108 | - Constant folding: by which we can replace constant expressions with their values. 109 | 110 | The most common cases in which algebraic simplification can be applied are: 111 | 112 | - Computational order optimization: optimization finds and removes remodeling/transposition operations based on specific features. 113 | - Node combination optimization: optimization combines multiple consecutive transposition nodes into one node, removes identity transposition nodes, and optimizes transposition nodes into remodeling nodes when they do not actually move data. 114 | - ReduceMean node optimization: the optimization performs ReduceMean replacement with the AvgPool node if the input of the reduce operator is 4D with the last two dimensions to be reduced. 115 | - **Operator merging**: It enables better computation sharing, eliminates intermediate allocations, facilitates further optimization by combining nested cycles, and reduces launch and synchronization overhead. 116 | - **Sinking operators**: This optimization sinks operations such as transpositions below operations such as batch normalization, ReLU, sigmoid and channel shuffle. Through this optimization, many similar operations are brought closer to each other, creating more opportunities for algebraic simplification. 117 | 118 | ### **Dataflow Level Optimizations:** 119 | 120 | - **Common subexpression elimination (CSE)**: An expression E is a common subexpression if the value of E has been previously calculated and the value of E should not be changed after the previous calculation. In this case, the value of E is calculated only once and the already calculated value of E can be used to avoid being recompiled at other points. 121 | - **Dead Code Elimination (DCE):** a set of code is dead if the calculated results or side effects are not used. DCE optimization removes dead code. Dead code is usually not caused by programmers, but by other graph optimizations. Other optimizations, such as dead store elimination (DSE), which removes stores in tensors that will never be used, also belong to DCE. 122 | - **Static memory planning**: is performed to reuse memory buffers as much as possible. Typically, there are two approaches: 123 | - In-Place Memory Sharing: uses the same memory for the input and output of an operation and merely allocates a copy of the memory before the computation is executed. 124 | - Standard Memory Sharing: reuses memory from previous operations without overlapping them. 125 | - **Layout transformation**: Layout transformation tries to find the best data layouts to store the tensors in the computation graph and then inserts the layout transformation nodes into the graph. Note that the actual transformation is not performed here, but will be performed during evaluation of the computation graph by the compiler backend. Indeed, the performance of the same operation in different data layouts is different, and even the best layouts are different on different hardware. Some DL compilers rely on hardware-specific libraries to achieve higher performance, and the libraries may require certain layouts. Not only does the layout of tensor data have a nontrivial influence on final performance, but transformation operations also have significant overhead. Because they also consume memory and computational resources. 126 | 127 | ## Low-level IR. 128 | 129 | The low-level IR is designed for hardware-specific optimization and code generation on different hardware targets. Therefore, the low-level IR must be fine enough to reflect hardware characteristics and represent hardware-specific optimizations. 130 | 131 | - Halide-based IR (TVM, Glow). The fundamental philosophy of Halide is the separation of computation and programming. Rather than directly providing a specific scheme, compilers adopting Halide try several possible programs and choose the best one. The boundaries of memory references and loop nests in Halide are limited to bounded boxes aligned to axes. Therefore, Halide cannot express computation with complicated patterns. Moreover, Halide can easily parameterize these boundaries and expose them to the adjustment mechanism. Halide's original IR must be modified when applied to the backend of DL compilers. 132 | - IR based on polyhedral models (Glow, OpenVino). The polyhedral model is an important technique adopted in DL compilers. It uses linear programming, affine transformations, and other mathematical methods to optimize loop-based codes with a static control flow of bounds and branches. Unlike Halide, the boundaries of memory references and loop nests can be polyhedra of any shape in the polyhedral model. However, such flexibility also prevents integration with tuning mechanisms. Polyhedron-based IR facilitates the application of various polyhedral transformations (e.g., fusion, tiling, sinking, and mapping), including device-dependent and device-independent optimizations. 133 | - Other unique IRs. There are DL compilers that implement low-level custom IRs without using Halide and the polyhedral model. On these custom low-level IRs, they apply hardware-specific optimizations and reductions to the LLVM IR. 134 | 135 | The low-level IR adopted by most DL compilers can eventually be lowered to LLVM's IR and benefit from LLVM's mature optimizer and code generator. In addition, LLVM can explicitly design custom instruction sets for specialized accelerators from scratch. However, traditional compilers can generate poor code when passed directly to LLVM IR. To avoid this situation, DL compilers apply two approaches to achieve hardware-dependent optimization: 136 | 137 | - perform a target-specific loop transformation in the upper IR of LLVM (IR based on Halide and Polyhedral e.g., Glow, TVM, XLA, OpenVino) 138 | - provide additional information about the hardware target for optimization steps. (e.g., Glow) Most DL compilers apply both approaches, but the emphasis is different. The compilation scheme in DL compilers can be classified mainly into two categories: 139 | - **just-in-time (JIT):** can generate executable codes on the fly and can optimize codes with better runtime knowledge. 140 | - **ahead-of-time (AOT):** they generate all executable binaries first and then execute them. Therefore, they have a wider scope in static analysis than JIT compilation. In addition, AOT approaches can be applied to cross-compilers of embedded platforms and allow execution on remote machines and custom accelerators. 141 | 142 | ## Back-End Optimizations 143 | 144 | The backend transforms high-level IR into low-level IR and performs hardware-specific optimizations. It can use generic third-party tool chains, such as LLVM IR or custom kernels, to leverage prior knowledge of DL models and hardware and generate code more efficiently. Such optimizations are: 145 | 146 | - **Hardware Intrinsic Mapping**: can transform a certain set of low-level IR instructions into kernels that are already highly optimized on the hardware. 147 | - **Memory Allocation and Retrieval**: The GPU mainly contains a shared memory space (lower access latency with limited memory size) and a local memory space (higher access latency with high capacity). This memory hierarchy requires efficient memory allocation and fetching techniques to improve data locality. 148 | - **Hiding memory latency**: is used in the backend by reordering the execution pipeline. Since most DL compilers support parallelization on CPU and GPU, memory latency hiding can be naturally achieved via hardware (e.g., context warp switching on the GPU). But for TPU-type accelerators with decoupled access and execution (DAE) architecture, the backend must perform fine-grained scheduling and synchronization to achieve correct and efficient codes. 149 | - **Parallelization**: Since modern processors generally support multi-threading and SIMD parallelism, the compiler back-end must exploit parallelism to maximize hardware utilization and achieve high performance. 150 | - **Loop-oriented optimizations:** 151 | - **Loop Fusion**: is a loop optimization technique that can merge loops with the same boundaries for better data reuse. 152 | - **Sliding Windows**: its core concept is to compute values when they are needed and store them on the fly to reuse the data until they are no longer needed. Since sliding windows interleaves the computation of two cycles and makes them serial, it is a compromise between parallelism and data reuse. 153 | - **Tiling**: splits loops into different tiles, so loops are divided into outer loops that iterate across tiles and inner loops that iterate within a tile. This transformation allows for better localization of data within a tile by placing a tile in hardware caches. Because the size of a tile is hardware-specific, many DL compilers determine the tile's pattern and size by autotuning. 154 | - **Loop reordering**: changes the order of iterations in a nested loop, which can optimize memory access and thus increase spatial locality. It is specific to data layout and hardware characteristics. However, it is not safe to perform loop unrolling when there are dependencies along the iteration order. 155 | - **Loop Unrolling**: can unroll a specific loop to a fixed number of copies of loop bodies, which allows compilers to apply aggressive instruction-level parallelism. Usually, loop unrolling is applied in combination with loop splitting, which first splits the loop into two nested loops and then completely unrolls the inner loop. 156 | 157 | 158 | 159 | 160 | 161 |

162 | Join the community • 163 | Contribute to the library 164 |

165 | -------------------------------------------------------------------------------- /knowledge_distillation_overview.md: -------------------------------------------------------------------------------- 1 |

2 | Join the community • 3 | Contribute to the library 4 |

5 | 6 | 7 | 8 | 9 | 10 | # Overview 11 | 12 | Here we will explore 13 | - Fundamentals 14 | - Papers, blogs and other resources 15 | 16 | ## What is knowledge distillation? 17 | 18 | xxx 19 | 20 | 21 | ### Key takeways 22 | 23 | xxxx 24 | -------------------------------------------------------------------------------- /knowledge_distillation_resources.md: -------------------------------------------------------------------------------- 1 |

2 | Join the community • 3 | Contribute to the library 4 |

5 | 6 | 7 | 8 | 9 | 10 | # Resources 11 | 12 | Awesome resource collection on quantization techniques. 13 | 14 | - Literature reviews and papers 15 | - Courses, webinars and blogs 16 | 17 | If you still didn't, check [Overview page on knowledge distillation](https://github.com/nebuly-ai/exploring-AI-optimization/blob/main/knowledge_distillation_overview.md) or [Exploring AI optimization](https://github.com/nebuly-ai/exploring-AI-optimization) to start getting into it. 18 | 19 | 20 | 21 | ## Literature reviews and papers 22 | Legend: 23 | - ✏️ 100-499 citations, ✏️✏️ $\geq$ 500 citations 24 | - ⭐ 100-249 stars, ⭐⭐ $\geq$ 250 stars 25 | 26 | Sorting: chronological/alphabetic order 27 |
28 | 29 | ### Literature reviews 30 | 31 | -title+link 32 | 33 | ### Papers 34 | 2022 35 | - title xxx 36 | 37 | 38 | 2021 39 | - - title xxx 40 | 41 | 2020 42 | - xxx 43 | 44 | ## Courses, webinars and blogs 45 | 46 | Webinars, video content 47 | - [xxxx 48 | 49 | Blogs, written content 50 | - xxxx link 51 | 52 | 53 | 54 | 55 |

56 | Join the community • 57 | Contribute to the library 58 |

59 | -------------------------------------------------------------------------------- /pruning-overview.md: -------------------------------------------------------------------------------- 1 |

2 | Join the community • 3 | Contribute to the library 4 |

5 | 6 | 7 | 8 | 9 | # Pruning 10 | Find below an overview of pruning techniques, while [here](https://github.com/nebuly-ai/exploring-AI-optimization/blob/main/pruning-resources.md) a curated collection of resources on the topic. 11 | 12 | The pruning algorithm usually can be divided in three main parts: 13 | * The selection of the parameters to prune. 14 | * The methodology to perform the pruning. 15 | * The fine-tuning of the remaining parameters. 16 | 17 | ## Pruning a Way to Find the "Lottery Ticket" 18 | Given a network, the lottery ticket hypothesis suggests that there is a subnetwork that is at least as accurate as the original network. Pruning, which is a technique for removing weights from a net without affecting its accuracy, aims to find this lottery ticket. 19 | The main advantages of a pruned network are: 20 | * The model is smaller: it has fewer weights and therefore has a smaller memory footprint. 21 | * The model is faster: the smaller weights can reduce the number of FLOPS of the model. 22 | * The training of the model is faster. 23 | 24 | These "winning tickets" or subnetworks are found to have special properties: 25 | * They are independent of the optimizer. 26 | * They are transferable between similar tasks. 27 | * The winning tickets for large-scale tasks are more transferable. 28 | 29 | 30 | ## Pruning Types 31 | Various [pruning techniques](#pruning-techniques) have been proposed in the literature. Here is an attempt to categorize common traits, namely [scale](#scale-structured-vs-unstructured), [data dependency](#data-dependency-data-dependent-vs-data-independent), [granularity](#granularity-one-shot-vs-iterative), [initialization](#initialization-random-later-iteration-fine-tuning), [reversibility](#reversibility-masked-vs-unmasked), [element type](#element-type-neuron-vs-connection), and [timing](#timing-dynamic-vs-static). 32 | 33 | ```mermaid 34 | flowchart LR 35 | A["Pruning \n Types"] --> B["Scale"] 36 | B --> BA["Structured"] 37 | B --> BB["Unstructured"] 38 | A --> C["Data \n Dependency"] 39 | C --> CA["Data \n Dependent"] 40 | C --> CB["Data \n Independent"] 41 | A --> D["Granularity"] 42 | D --> DA["One-Shot"] 43 | D --> DB["Iterative"] 44 | A --> E["Initialization"] 45 | E --> EA["Random"] 46 | E --> EB["Later Iteration"] 47 | E --> EC["Fine-Tuning"] 48 | A --> F["Reversibility"] 49 | F --> FA["Masked"] 50 | F --> FB["Unmasked"] 51 | A --> G["Element Type"] 52 | G --> GA["Neuron"] 53 | G --> GB["Connection"] 54 | A --> H["Timings"] 55 | H --> HA["Static"] 56 | H --> HB["Dynamic"] 57 | ``` 58 | 59 | ### Scale: Structured Vs Unstructured 60 | Unstructured pruning occurs when the weights to be pruned are individually targeted without taking into account the layer structure. This means that, once the pruning principle is defined, the selection of weights to be eliminated becomes fairly simple. 61 | Since the layer structure is not taken into account, this type of pruning may not improve the performance of the model. The typical case is that as the number of pruned weights increases, the matrix becomes more and more sparse; sparsity requires ad hoc computational techniques that may produce even worse results if the tradeoff between representation overhead and the amount of computation performed is not balanced. For this reason, the performance increase with this type of pruning is usually only observable for a high pruning ratio. 62 | Structured Pruning, instead of focusing on individual weights, attempts to prune an entire structure by producing a more ordered sparsity that is computationally easier to handle, sacrificing the simplicity of the pruning algorithm. 63 | 64 | ### Data Dependency: Data Dependent Vs Data Independent 65 | The distinction between pruning techniques that use only weight as information for the pruning algorithm (Data Independent) and techniques that perform further analyses that require some input data to be provided (Data Dependent). Usually data-dependent techniques, because they require performing more calculations, are more costly in terms of time and resources. 66 | 67 | ### Granularity: One-Shot Vs Iterative 68 | One-shot techniques establish a criterion, such as the amount of pruning weight or model compression, and perform pruning in a single pass. Iterative techniques, on the other hand, adapt their learning and pruning ratio through several training epochs, usually producing much better results in terms of both compression achieved and limited accuracy degradation. 69 | 70 | ### Initialization: Random, Later Iteration, Fine-Tuning 71 | Especially when performing iterative pruning with different pruning ratios, there are several ways to set the initial weight between pruning stages. It can be set randomly each time, or kept from the previous era, precisely adjusting the remaining weights to balance those that have been pruned. Another technique is to take a later iteration as a starting point, using a trade-off between a random random and maintaining the same weight from the previous epoch; in this way, the model has more freedom to adapt to the pruned weight and generally adapts better to the change. 72 | 73 | ### Reversibility: Masked Vs Unmasked 74 | One of the problems with pruning is that some of the weights that are removed in the first iterations may actually be critical, and their saliency may be more pronounced than pruning increases. For this reason, some techniques, instead of completely removing the weights, adopt a masking technique that is able to maintain and restore the value of the weights if in a later iteration they start to become relevant. 75 | 76 | ### Element Type: Neuron Vs Connection 77 | They are simply the different types of pruned elements, whether it can be a connection between two neurons or a neuron. 78 | 79 | ### Timing: Dynamic Vs Static 80 | Dynamic pruning is performed at runtime. It introduces some overhead but can be adaptively performed computation by computation. Static pruning is performed before the deploying the model. 81 | 82 | 83 | 84 | ## Pruning Techniques 85 | 86 | The techniques explored in the literature can be grouped into clusters: 87 | - Magnitude-based 88 | - Filter and feature map selection 89 | - Optimization-based 90 | - Sensitivity-based 91 | 92 | ```mermaid 93 | flowchart LR 94 | A["Pruning \n Techniques"] --> B["Magnitude \n Based"] 95 | A --> C["Filter/ \n Feature Map \n Selection"] 96 | C --> CA["Variance Selection"] 97 | C --> CB["Entropy Based"] 98 | C --> CC["Zeros Counting \n (APoZ)"] 99 | C --> CD["Filter Weight \n Summing"] 100 | C --> CE["Geometric Mean \n Distance"] 101 | A --> D["Optimization \n Based"] 102 | D --> DA["Greedy Search"] 103 | D --> DB["LASSO"] 104 | A --> E["Sensitivity \n Based"] 105 | E --> EA["First Order \n Taylor Expansion"] 106 | E --> EB["Hessian \n Approximation"] 107 | click CB "https://arxiv.org/abs/1706.05791" 108 | ``` 109 | 110 | ### MAGNITUDE BASED: Simple, Regularized. 111 | Under the hypothesis that smaller weights have a minor impact on the model accuracy, the weights that are smaller than a given threshold are pruned. To enforce the weight to be pruned, some regularization can be applied. Usually $L_1$ norm is better right after pruning while $L_2$ works better if the weights of the pruned network are fine-tuned. 112 | 113 | ### VARIANCE SELECTION 114 | #### Inboud Pruning [[1]](#1) 115 | The input pruning method targets the number of channels on which each filter operates. The amount of information each channel brings is measured by the variance of the activation output of the specific channel. 116 | 117 | $$ \sigma_{ts} = var(|| W_{ts} * X_{s} ||_F) $$ 118 | 119 | Where $t$ is the filter, $s$ the activation channel. 120 | Pruning of the entire network is performed sequentially on the layers of the network: from lower to higher layers, pruning is followed by fine-tuning, which is followed by pruning of the next layer. Speeding up the input pruning scheme is directly achieved by reducing the amount of computation that each filter performs. 121 | 122 | #### Reduce and Reuse [[1]](#1) 123 | To find candidates for pruning, the variance of each filter output on a sample of the training set is calculated. Then, all filters whose score is less than the percentile $\mu$ are eliminated. 124 | 125 | $$ \sigma_{t} = var(|| W_{t} * X ||_F) $$ 126 | 127 | The pruning performed requires adaptation, since the next layer expects to receive inputs of the size prior to pruning and with similar activation patterns. Therefore, the remaining output channels are reused to reconstruct each pruned channel, through the use of linear combinations. 128 | 129 | $$ A = min_{A} \left(\sum_i || Y_i - A Y'_i) ||_2^2 \right)$$ 130 | 131 | Where $Y_i$ is the output before pruning, $Y'_i$ is the output after pruning, and $i$ are the iterations on the input samples, each input sample producing different activations. 132 | The operation of A is added to the network by introducing a convolution layer with filters of size 1 × 1, which contains the elements of A. 133 | 134 | #### Entropy Based Pruning [[2]](#2) 135 | Entropy-based metric to evaluate the weakness of each channel: a larger entropy value means the system contains more information. First is used global average pooling to convert the output of layer $i$, which is a $c \times h \times w$ tensor, into a $1 \times c$ vector. In order to calculate the entropy, more output values need to be collected, which can be obtained using an evaluation set. 136 | Finally, we get a matrix $M \in R^{ n \times c}$, where $n$ is the number of images in the evaluation set, and $c$ is the channel number. For each channel $j$, we would pay attention to the distribution of $M[:,j]$. To compute the entropy value of this channel, we first divide it into $m$ different bins and calculate the probability of each bin. 137 | The entropy is then computed as 138 | 139 | $$ H_j = - \sum_i^m p_i log(p_i) $$ 140 | 141 | Where, $p_i$ is the probability of bin $i$, $H_j$ is the entropy of channel $j$. Whenever one layer is pruned, the whole network is fine-tuned with one or two epochs to recover its performance slightly. Only after the final layer has been pruned, the network is fine-tuned carefully with many epochs. 142 | 143 | #### APoZ [[3]](#3) 144 | It is defined Average Percentage of Zeros (APoZ) to measure the percentage of zero activations of a neuron after the ReLU mapping. $APoZ_c^{(i)}$ of the $c-th$ neuron in $i-th$ layer is defined as: 145 | 146 | $$ APoZ_c^{(i)} = \frac{\sum_{k} \sum_{j} f(O_{c,j}^{(i)}(k))}{N \times M}$$ 147 | 148 | Where, $O_c^{(i)}$ denotes the output of the $c-th$ neuron in $i-th$ layer, $M$ denotes the dimension of output feature map of $O_c^{(i)}$ , and $N$ denotes the total number of validation examples. While $f(\cdot)$ is a function that is equal to one only if $O_c^{(i)} = 0$ and zero otherwise. 149 | The higher mean APoZ also indicates more redundancy in a layer. Since a neural network has a multiplication-addition-activation computation process, a neuron which has its outputs mostly zeros will have very little contribution to the output of subsequent layers, as well as to the final results. Thus, we can remove those neurons without harming too much the overall accuracy of the network. 150 | Empirically, it was found that by starting by trimming from a few layers with high mean APoZ, and then progressively trimming neighboring layers, it is possible to rapidly reduce the number of neurons while maintaining the performance of the original network. Pruning the neurons whose APoZ is larger than one standard derivation from the mean APoZ of the target trimming layer would produce good retraining results. 151 | 152 | #### Filter Weight Summing [[4]](#4) 153 | The relative importance of a filter in each layer is measured by calculating the sum of its absolute weights. 154 | 155 | $$ s_j = \sum |F_{i,j}|$$ 156 | 157 | This value gives an expectation of the magnitude of the output feature map. Then each filter is sorted based on $s_j$, and $m$ are pruned together with the kernels in next layer corresponding to the pruned features. After a layer is pruned, the network is fine-tuned, and pruning is continued layer by layer. 158 | 159 | #### Geometric Median [[5]](#5) 160 | The geometric median is used to get the common information of all the filters within the single ith layer: 161 | 162 | $$ x^{GM} = argmin_x \sum_{j'} || x - F_{i,j'} ||_2$$ 163 | $$ x \in R ^ {N_i \times K \times K} $$ 164 | $$ j'\in [1, N_{i+1}] $$ 165 | 166 | Where $N_i$ and $N_{i+1}$, to represent the number of input chan-nels and the output channels for the $i-th$ convolution layer, respectively. $F_{i,j}$ represents the $j-th$ filter of the $i-th$ layer, then the dimension of filter $F_{i,j}$ is $N_i \times K \times K$, where $K$ is the kernel size of the network. 167 | 168 | Then is found the filters that are closer to the geometric mean of the filters and pruned since it can be represented by the other filters. 169 | 170 | These calculations are quite complex, but it is possible to find the filter that minimizes the sum of the distance with the other filters. 171 | 172 | $$ F_{i,x^*} = argmin_x \sum_{j'} || x - F_{i,j'} ||_2$$ 173 | $$ x \in [F_{i,1} \dots F_{i,N_{i+1}}] $$ 174 | $$ j'\in [1, N_{i+1}] $$ 175 | 176 | Since the selected filter(s), $F_{i, x^* }$ , and left ones share the most common information. This indicates the information of the filter(s) $F_{i, x^* }$ could be replaced by others. 177 | 178 | ### OPTIMIZATION BASED [[6]](#6) 179 | 180 | #### Greedy search pruning 181 | 182 | This method “starts from an empty network and greedily adds important neurons from the large network” of a pre-trained model. With this technique: 183 | - you should obtain better accuracy than the other way around 184 | 185 | - you can prune also randomly weighted networks 186 | 187 | Researches on the topic suggest not to then re-train the new sub-network, but just to fine-tune it “to leverage the information from the large model”. [[7]](#7) 188 | 189 | #### Obd [[8]](#8) 190 | 191 | This technique dating back to 1989 has been further developed in new and better-performing methodologies, but we report it as it was a pillar in the pruning literature. The Optimal Brain Damage technique uses “second-derivative information to make a tradeoff between network complexity and training set error”, which in this case means “selectively deleting weights” based on “two terms: the ordinary training error” and “some measure of the network complexity”. 192 | 193 | #### Lasso [[9]](#9) 194 | 195 | The Least Absolute Shrinkage and Selection Operator, and the extensions derived from it, compared to other techniques, perform pruning at a higher level. It acts at the level of weights, neurons as well as filters, creating a superior sparsity on the level of weights. 196 | 197 | In this technique, the loss is constantly increased until some degree of sparsity is observed”; “there is no exact method for obtaining a satisfactory value”, and it depends also on the size of the dataset. 198 | 199 | 200 | 201 | ### SENSITIVITY BASED [[10]](#10) 202 | 203 | Large networks might contain multiple connections that contribute only a small part to the overall accuracy. In addition, redundant connections could compromise the generalization capability that a deep network should have, or its efficiency in being trained especially on edges. By disabling connections between neurons selectively, we can reduce network connections and the number of operational needs for network training. Furthermore, the connection mask extracted can be reused for training the network from scratch, with only a small accuracy loss. [[11]](#11) 204 | 205 | The sensitivity measure is the probability of output inversions due to input variation with respect to overall input patterns. This kind of pruning can be repeated iteratively until no more can be removed under a given performance requirement. 206 | 207 | #### Taylor expansion - first order (lighter than the following second order one) [[12]](#12) 208 | 209 | The weights and their gradients are used for ranking filters, which require far less memory consumption than feature maps. Element-wise product of weights and gradients is able to gather exactly the first-degree Taylor polynomial of corresponding feature map w.r.t loss through back-propagation, meaning it approximates the actual damage. 210 | 211 | #### Hessian approximation (HAP) - second order [[13]](#13) 212 | 213 | The HAP method is a structured pruning method that cuts channels based on their second-order sensitivity, which measures the flatness/sharpness of the lost landscape. Channels are sorted based on this metric, and only insensitive channels are pruned. 214 | 215 | ## References 216 | 217 | [1] 218 | Polyak, Adam and Wolf, Lior, ["Channel-level acceleration of deep face representations"](https://ieeexplore.ieee.org/document/7303876), in IEEE Access (2015). 219 | 220 | [2] 221 | Luo, Jian-Hao, and Jianxin Wu, ["An Entropy-based Pruning Method for CNN Compression"](https://arxiv.org/abs/1706.05791), in arXiv preprint arXiv:1706.05791 (2017). 222 | 223 | 224 | [3] 225 | Hu, H., Peng, R., Tai, Y. W., & Tang, C. K., ["Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures"](https://arxiv.org/abs/1607.03250) in arXiv preprint arXiv:1607.03250 (2016). 226 | 227 | [4] 228 | Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. ["Pruning filters for efficient convnets"](https://arxiv.org/abs/1608.08710). arXiv preprint arXiv:1608.08710 (2016). 229 | 230 | [5] 231 | He, Y., Liu, P., Wang, Z., Hu, Z., & Yang, Y. . ["Filter pruning via geometric median for deep convolutional neural networks acceleration"](https://arxiv.org/abs/1811.00250). In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019). 232 | 233 | [6] 234 | Ekbom, L. .["Exploring the LASSO as a Pruning Method"](https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8993061&fileOId=8993068). (2019). 235 | 236 | [7] 237 | Ye, M., Lemeng, W., & Qiang, L. . ["Greedy optimization provably wins the lottery: Logarithmic number of winning tickets is enough."](https://par.nsf.gov/servlets/purl/10276253). Advances in Neural Information Processing Systems 33 (2020) 238 | 239 | [8] 240 | LeCun, Y., Denver, J., & Solla, S.. ["Optimal brain damage."](https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html). Advances in neural information processing systems 2 (1989). 241 | 242 | [9] 243 | Ekbom, L. . ["Exploring the LASSO as a Pruning Method."](https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8993061&fileOId=8993068). (2019). 244 | 245 | [10] 246 | Dajie Cooper, Y. . ["Sensitivity Based Network Pruning: A Modern Perspective."](http://users.cecs.anu.edu.au/~Tom.Gedeon/conf/ABCs2018/paper/ABCs2018_paper_135.pdf). 247 | 248 | [11] 249 | Xiaoqin, Z. et al. . ["A sensitivity-based approach for pruning architecture of Madalines."](https://www.researchgate.net/publication/220372395_A_sensitivity-based_approach_for_pruning_architecture_of_Madalines). 250 | Neural Computing and Applications 18.8 (2009). 251 | 252 | [12] 253 | Mengqing, W. . ["Pruning Convolutional Filters with First Order Taylor Series Ranking."](http://users.cecs.anu.edu.au/~Tom.Gedeon/conf/ABCs2018/paper/ABCs2018_paper_53.pdf). 254 | 255 | [13] 256 | Shixing, Y. et al. . ["Hessian-aware pruning and optimal neural implant."](https://openaccess.thecvf.com/content/WACV2022/papers/Yu_Hessian-Aware_Pruning_and_Optimal_Neural_Implant_WACV_2022_paper.pdf). Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. (2022). 257 | 258 | 259 | 260 | 261 | 262 |

263 | Join the community • 264 | Contribute to the library 265 |

266 | -------------------------------------------------------------------------------- /pruning-resources.md: -------------------------------------------------------------------------------- 1 | 2 |

3 | Join the community • 4 | Contribute to the library 5 |

6 | 7 | 8 | 9 | 10 | 11 | # Resources 12 | 13 | Awesome resource collection on pruning techniques: 14 | 15 | - Literature reviews and papers 16 | - Courses, webinars and blogs 17 | 18 | If you are new to pruning check the [overview page on pruning](https://github.com/nebuly-ai/exploring-AI-optimization/blob/main/pruning-overview.md) or the introduction to AI optimization at this [link](https://github.com/nebuly-ai/exploring-AI-optimization). 19 | 20 | 21 | ## Literature reviews and papers 22 | Legenda: 23 | - ✏️ 100-499 citations, ✏️✏️ $\geq$ 500 citations 24 | - ⭐ 100-249 stars, ⭐⭐ $\geq$ 250 stars 25 | 26 | Sorting: typology / chronological / alphabetical order 27 |
28 | 29 | ### Literature reviews 30 | - A Survey of Model Compression and Acceleration for Deep Neural Networks [[✏️✏️paper](https://arxiv.org/abs/1710.09282)] 31 | - APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Wang_APQ_Joint_Search_for_Network_Architecture_Pruning_and_Quantization_Policy_CVPR_2020_paper.html)][⭐⭐[github](https://github.com/mit-han-lab/apq)] 32 | - Compression of Deep Learning Models for Text: A Survey [[✏️paper](https://arxiv.org/pdf/2008.05221.pdf)] 33 | - Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/1113d7a76ffceca1bb350bfe145467c6-Abstract.html)] 34 | - Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding [[✏️✏️paper](https://arxiv.org/abs/1510.00149)] 35 | - Dynamic Network Surgery for Efficient DNNs [[✏️✏️NIPS](https://proceedings.neurips.cc/paper/2016/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html)] 36 | - Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better Models Smaller, Faster, and Better [[paper](https://arxiv.org/pdf/2106.08962.pdf)][⭐⭐[github](https://github.com/reddragon/efficient-dl-survey-paper)] 37 | - ETH 2021, Sparsity in Deep Learning: Pruning + growth for efficient inference and training in neural networks: [[✏️JLMR](https://www.jmlr.org/papers/volume22/21-0366/21-0366.pdf)][⭐⭐[github](https://github.com/spcl/sparsity-in-deep-learning)] 38 | - Learning both Weights and Connections for Efficient Neural Networks [[✏️✏️NIPS](https://proceedings.neurips.cc/paper/2015/hash/ae0eb3eed39d2bcef4622b2499a05fe6-Abstract.html)] 39 | - Learning Efficient Convolutional Networks Through Network Slimming [[✏️✏️ICCV](https://openaccess.thecvf.com/content_iccv_2017/html/Liu_Learning_Efficient_Convolutional_ICCV_2017_paper.html)][⭐⭐[github 1.](https://github.com/Eric-mingjie/network-slimming)[github 1.](https://github.com/Eric-mingjie/network-slimming)] 40 | - Learning Filter Basis for Convolutional Neural Network Compression [[✏️ICCV](https://openaccess.thecvf.com/content_ICCV_2019/html/Li_Learning_Filter_Basis_for_Convolutional_Neural_Network_Compression_ICCV_2019_paper.html)] 41 | - Pruning Filters for Efficient ConvNets [[✏️✏️paper](https://arxiv.org/abs/1608.08710)] 42 | - Pruning from Scratch [[✏️AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/6910)] 43 | - Recent Advances in Efficient Computation of Deep Convolutional Neural Networks [[✏️paper](https://scholar.google.com/scholar?hl=es&as_sdt=0%2C5&q=Recent+Advances+in+Efficient+Computation+of+Deep+Convolutional+Neural+Networks&btnG=#:~:text=ACNP%20Full%20Text-,Recent%20advances%20in%20efficient%20computation%20of%20deep%20convolutional%20neural%20networks,-J%20Cheng%2C)] 44 | - Rethinking the Value of Network Pruning [[✏️✏️paper](https://arxiv.org/abs/1810.05270)][⭐⭐[github](https://github.com/Eric-mingjie/rethinking-network-pruning)] 45 | - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks [[✏️✏️paper](https://arxiv.org/abs/1803.03635)] 46 | - To prune, or not to prune: exploring the efficacy of pruning for model compression [[✏️✏️paper](https://arxiv.org/abs/1710.01878)] 47 |
48 | 49 | ### Papers 50 | 2021 51 | - A Probabilistic Approach to Neural Network Pruning [[PMLR](http://proceedings.mlr.press/v139/qian21a.html)] 52 | - Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework [[PMLR](http://proceedings.mlr.press/v139/wang21e.html?ref=https://githubhelp.com)] 53 | - ChipNet: Budget-Aware Pruning with Heaviside Continuous Approximations [[paper](https://arxiv.org/abs/2102.07156)][⭐⭐[github](https://github.com/transmuteAI/ChipNet)] 54 | - Content-Aware GAN Compression [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Liu_Content-Aware_GAN_Compression_CVPR_2021_paper.html)][⭐⭐[github](https://github.com/lychenyoko/content-aware-gan-compression)] 55 | - Convolutional Neural Network Pruning with Structural Redundancy Reduction [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Wang_Convolutional_Neural_Network_Pruning_With_Structural_Redundancy_Reduction_CVPR_2021_paper.html)] 56 | - Dynamic Network Surgery for Efficient DNNs [[✏️✏️NIPS](https://proceedings.neurips.cc/paper/2016/hash/2823f4797102ce1a1aec05359cc16dd9-Abstract.html)][⭐⭐[github](https://github.com/yiwenguo/Dynamic-Network-Surgery)] 57 | - Group Fisher Pruning for Practical Network Compression [[PMLR](http://proceedings.mlr.press/v139/liu21ab.html?ref=https://githubhelp.com)] 58 | - Joint-DetNAS: Upgrade Your Detector with NAS, Pruning and Dynamic Distillation [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Yao_Joint-DetNAS_Upgrade_Your_Detector_With_NAS_Pruning_and_Dynamic_Distillation_CVPR_2021_paper.html)] 59 | - Manifold Regularized Dynamic Network Pruning [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Tang_Manifold_Regularized_Dynamic_Network_Pruning_CVPR_2021_paper.html)] 60 | - Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network [[paper](https://arxiv.org/abs/2103.09377)][⭐⭐[github](https://github.com/chrundle/biprop)] 61 | - Network Pruning That Matters: A Case Study on Retraining Variants [[paper](https://arxiv.org/abs/2105.03193)][⭐⭐[github](https://github.com/lehduong/NPTM)] 62 | - Network Pruning via Performance Maximization [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Gao_Network_Pruning_via_Performance_Maximization_CVPR_2021_paper.html)] 63 | - NPAS: A Compiler-aware Framework of Unified Network Pruning andArchitecture Search for Beyond Real-Time Mobile Acceleration [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Li_NPAS_A_Compiler-Aware_Framework_of_Unified_Network_Pruning_and_Architecture_CVPR_2021_paper.html)] 64 | - On the Predictability of Pruning Across Scales [[PMLR](http://proceedings.mlr.press/v139/rosenfeld21a.html)] 65 | - Prune Once for All: Sparse Pre-Trained Language Models [[paper](https://arxiv.org/abs/2111.05754)][⭐⭐[github](https://github.com/intellabs/model-compression-research-package)] 66 | - Pruning Neural Networks at Initialization: Why Are We Missing the Mark? [[✏️paper](https://arxiv.org/abs/2009.08576)] 67 | - Towards Compact CNNs via Collaborative Compression [[CVPR](https://arxiv.org/abs/2105.11228)] 68 | 69 | 70 | 2020 71 | 72 | - A Gradient Flow Framework For Analyzing Network Pruning [[paper](https://arxiv.org/abs/2009.11839)][⭐⭐[github](https://github.com/EkdeepSLubana/flowandprune)] 73 | - A Signal Propagation Perspective for Pruning Neural Networks at Initialization [[✏️paper](https://arxiv.org/abs/1906.06307)][⭐⭐[github](https://github.com/namhoonlee/spp-public)] 74 | - Accelerating CNN Training by Pruning Activation Gradients [[ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58595-2_20)] 75 | - Adversarial Neural Pruning with Latent Vulnerability Suppression [[PMLR](http://proceedings.mlr.press/v119/madaan20a.html)][⭐⭐[github](https://github.com/divyam3897/ANP_VS)] 76 | - AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates [[✏️AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/5924)] 77 | - Bayesian Bits: Unifying Quantization and Pruning [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/3f13cf4ddf6fc50c0d39a1d5aeb57dd8-Abstract.html)] 78 | - Channel Pruning via Automatic Structure Search [[✏️paper](https://arxiv.org/abs/2001.08565)][⭐⭐[github](https://github.com/lmbxmu/ABCPruner)] 79 | - Comparing Rewinding and Fine-tuning in Neural Network Pruning [[✏️paper](https://arxiv.org/abs/2003.02389)][⭐⭐[github](https://github.com/lottery-ticket/rewinding-iclr20-public)] 80 | - DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search [[ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58583-9_35)] 81 | - DHP: Differentiable Meta Pruning via HyperNetworks [[✏️ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58598-3_36)][⭐⭐[github](https://github.com/ofsoundof/dhp)] 82 | - Differentiable Joint Pruning and Quantization for Hardware Efficiency [[ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58526-6_16)] 83 | - Directional Pruning of Deep Neural Networks [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/a09e75c5c86a7bf6582d2b4d75aad615-Abstract.html)][⭐⭐[github](https://github.com/donlan2710/gRDA-Optimizer/tree/master/directional_pruning)] 84 | - Discrete Model Compression With Resource Constraint for Deep Neural Networks [[CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Gao_Discrete_Model_Compression_With_Resource_Constraint_for_Deep_Neural_Networks_CVPR_2020_paper.html)] 85 | - DMCP: Differentiable Markov Channel Pruning for Neural Networks [[✏️CVPR](https://arxiv.org/abs/2005.03354)][⭐⭐[github](https://arxiv.org/abs/2005.03354)] 86 | - DropNet: Reducing Neural Network Complexity via Iterative Pruning [[PMLR](http://proceedings.mlr.press/v119/tan20a.html)][⭐⭐[github](https://github.com/tanchongmin/DropNet)] 87 | - DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation [[ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58580-8_35)] 88 | - Dynamic Model Pruning with Feedback [[✏️paper](https://arxiv.org/abs/2006.07253)] 89 | - EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning [[✏️ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58536-5_38)][⭐⭐[github](https://github.com/anonymous47823493/EagleEye)] 90 | - Fast Convex Pruning of Deep Neural Networks [[paper](https://arxiv.org/pdf/1806.06457.pdf)] 91 | - Few Sample Knowledge Distillation for Efficient Network Compression [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Li_Few_Sample_Knowledge_Distillation_for_Efficient_Network_Compression_CVPR_2020_paper.html)] 92 | - Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection [[✏️PMLR](https://proceedings.mlr.press/v119/ye20b.html)][⭐⭐[github](https://github.com/lushleaf/Network-Pruning-Greedy-Forward-Selection)] 93 | - Group Sparsity: The Hinge Between Filter Pruning and Decomposition for Network Compression [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Li_Group_Sparsity_The_Hinge_Between_Filter_Pruning_and_Decomposition_for_CVPR_2020_paper.html)][⭐⭐[github](https://github.com/ofsoundof/group_sparsity)] 94 | - HRank: Filter Pruning using High-Rank Feature Map [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Lin_HRank_Filter_Pruning_Using_High-Rank_Feature_Map_CVPR_2020_paper.html)][⭐⭐[github](https://github.com/lmbxmu/HRank)] 95 | - HYDRA: Pruning Adversarially Robust Neural Networks [[✏️NIPS](https://proceedings.neurips.cc/paper/2020/hash/e3a72c791a69f87b05ea7742e04430ed-Abstract.html)] 96 | - Layer-adaptive Sparsity for the Magnitude-based Pruning [[paper](https://arxiv.org/abs/2010.07611)][⭐⭐[github](https://github.com/jaeho-lee/layer-adaptive-sparsity)] 97 | - Learning Filter Pruning Criteria for Deep Convolutional Neural Networks Acceleration [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/He_Learning_Filter_Pruning_Criteria_for_Deep_Convolutional_Neural_Networks_Acceleration_CVPR_2020_paper.html)] 98 | - Logarithmic Pruning is All You Need [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/1e9491470749d5b0e361ce4f0b24d037-Abstract.html)] 99 | - Lookahead: A Far-sighted Alternative of Magnitude-based Pruning [[paper](https://arxiv.org/abs/2002.04809)][⭐⭐[github](https://github.com/alinlab/lookahead_pruning)] 100 | - Meta-Learning with Network Pruning [[ECCV](https://arxiv.org/abs/2007.03219)] 101 | - Movement Pruning: Adaptive Sparsity by Fine-Tuning [[✏️NIPS](https://proceedings.neurips.cc/paper/2020/hash/eae15aabaa768ae4a5993a8a4f4fa6e4-Abstract.html)] 102 | - Multi-Dimensional Pruning: A Unified Framework for Model Compression [[CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Guo_Multi-Dimensional_Pruning_A_Unified_Framework_for_Model_Compression_CVPR_2020_paper.html)] 103 | - Neural Network Pruning with Residual-Connections and Limited-Data [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Luo_Neural_Network_Pruning_With_Residual-Connections_and_Limited-Data_CVPR_2020_paper.html)] 104 | - Neural Pruning via Growing Regularization [[paper](https://arxiv.org/abs/2012.09243)][⭐⭐[github](https://github.com/mingsun-tse/regularization-pruning)] 105 | - Neuron Merging: Compensating for Pruned Neurons [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/0678ca2eae02d542cc931e81b74de122-Abstract.html)][⭐⭐[github](https://github.com/friendshipkim/neuron-merging)] 106 | - Neuron-level Structured Pruning using Polarization Regularizer [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/703957b6dd9e3a7980e040bee50ded65-Abstract.html)] 107 | - Operation-Aware Soft Channel Pruning using Differentiable Masks [[✏️PMLR](https://proceedings.mlr.press/v119/kang20a.html)] 108 | - Position-based Scaled Gradient for Model Quantization and Pruning [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/eb1e78328c46506b46a4ac4a1e378b91-Abstract.html)] 109 | - ProxSGD: Training Structured Neural Networks under Regularization and Constraints [[ICLR](https://orbilu.uni.lu/bitstream/10993/45122/1/proxsgd_training_structured_neural_networks_under_regularization_and_constraints.pdf)] 110 | - Pruning Filter in Filter [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/ccb1d45fb76f7c5a0bf619f979c6cf36-Abstract.html)] 111 | - Pruning neural networks without any data by iteratively conserving synaptic flow [[✏️NIPS](https://proceedings.neurips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html)] 112 | - Reborn filters: Pruning convolutional neural networks with limited data [[AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/6058)] 113 | - Robust Pruning at Initialization [[paper](https://arxiv.org/abs/2002.08797)] 114 | - Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/eae27d77ca20db309e056e3d2dcd7d69-Abstract.html)] 115 | - SCOP: Scientific Control for Reliable Neural Network PruningNeurIPSFPyTorch [[✏️NIPS](https://proceedings.neurips.cc/paper/2020/hash/7bcdf75ad237b8e02e301f4091fb6bc8-Abstract.html)] 116 | - Soft Threshold Weight Reparameterization for Learnable Sparsity [[✏️PMLR](https://proceedings.mlr.press/v119/kusupati20a.html)][⭐⭐[github](https://github.com/RAIVNLab/STR)] 117 | - Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/a914ecef9c12ffdb9bede64bb703d877-Abstract.html)] 118 | - Structured Compression by Weight Encryption for Unstructured Pruning and Quantization [[CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Kwon_Structured_Compression_by_Weight_Encryption_for_Unstructured_Pruning_and_Quantization_CVPR_2020_paper.html)] 119 | - The Generalization-Stability Tradeoff In Neural Network Pruning [[NIPS](https://proceedings.neurips.cc/paper/2020/hash/ef2ee09ea9551de88bc11fd7eeea93b0-Abstract.html)] 120 | - Towards Efficient Model Compression via Learned Global Ranking [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Chin_Towards_Efficient_Model_Compression_via_Learned_Global_Ranking_CVPR_2020_paper.html)][⭐⭐[github](https://github.com/enyac-group/LeGR)] 121 | 122 | 123 | 2019 124 | - Accelerate CNN via Recursive Bayesian Pruning [[CVPR](https://openaccess.thecvf.com/content_ICCV_2019/html/Zhou_Accelerate_CNN_via_Recursive_Bayesian_Pruning_ICCV_2019_paper.html)] 125 | - Adversarial Robustness vs Model Compression, or Both? [[✏️ICCV](https://openaccess.thecvf.com/content_ICCV_2019/html/Ye_Adversarial_Robustness_vs._Model_Compression_or_Both_ICCV_2019_paper.html)] 126 | - Approximated Oracle Filter Pruning for Destructive CNN Width Optimization github [[✏️PMLR](http://proceedings.mlr.press/v97/ding19a.html)][⭐⭐[github](https://github.com/DingXiaoH/AOFP)] 127 | - AutoPrune: Automatic Network Pruning by Regularizing Auxiliary Parameters [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/4efc9e02abdab6b6166251918570a307-Abstract.html)] 128 | - Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Ding_Centripetal_SGD_for_Pruning_Very_Deep_Convolutional_Networks_With_Complicated_CVPR_2019_paper.html)] 129 | - Collaborative Channel Pruning for Deep Networks [[✏️PMLR](http://proceedings.mlr.press/v97/peng19c.html?ref=https://githubhelp.com)] 130 | - COP: Customized Deep Model Compression via Regularized Correlation-Based Filter-Level Pruning [[paper](https://arxiv.org/abs/1906.10337)][⭐⭐[github](https://github.com/ZJULearning/COP)] 131 | - DARB: A Density-Aware Regular-Block Pruning for Deep Neural Networks [[paper](https://arxiv.org/abs/1911.08020)] 132 | - Data-Independent Neural Pruning via Coresets [[✏️paper](https://arxiv.org/abs/1907.04018)] 133 | - EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis4 [[✏️ICML](https://proceedings.mlr.press/v97/wang19g.html)][⭐⭐[github](https://github.com/alecwangcq/EigenDamage-Pytorch)] 134 | - Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration [[✏️✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/He_Filter_Pruning_via_Geometric_Median_for_Deep_Convolutional_Neural_Networks_CVPR_2019_paper.html)] 135 | - Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/b51a15f382ac914391a58850ab343b00-Abstract.html)] 136 | - Global Sparse Momentum SGD for Pruning Very Deep Neural Networks [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/f34185c4ca5d58e781d4f14173d41e5d-Abstract.html)] 137 | - Importance Estimation for Neural Network Pruning [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Molchanov_Importance_Estimation_for_Neural_Network_Pruning_CVPR_2019_paper.html)] 138 | - MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning [[✏️ICCV](https://openaccess.thecvf.com/content_ICCV_2019/html/Liu_MetaPruning_Meta_Learning_for_Automatic_Neural_Network_Channel_Pruning_ICCV_2019_paper.html)] 139 | - Model Compression with Adversarial Robustness: A Unified Optimization Framework [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/2ca65f58e35d9ad45bf7f3ae5cfd08f1-Abstract.html)] 140 | - Network Pruning via Transformable Architecture Search [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/a01a0380ca3c61428c26a231f0e49a09-Abstract.html)][⭐⭐[github](https://github.com/D-X-Y/NAS-Projects)] 141 | - OICSR: Out-In-Channel Sparsity Regularization for Compact Deep Neural Networks [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Li_OICSR_Out-In-Channel_Sparsity_Regularization_for_Compact_Deep_Neural_Networks_CVPR_2019_paper.html)] 142 | - On Implicit Filter Level Sparsity in Convolutional Neural Networks [[CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Mehta_On_Implicit_Filter_Level_Sparsity_in_Convolutional_Neural_Networks_CVPR_2019_paper.html)] 143 | - One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers [[✏️NIPS](https://proceedings.neurips.cc/paper/2019/hash/a4613e8d72a61b3b69b32d040f89ad81-Abstract.html)] 144 | - One-Shot Pruning of Recurrent Neural Networks by Jacobian Spectrum Evaluation [[paper](https://arxiv.org/abs/1912.00120)] 145 | - Partial Order Pruning: for Best Speed/Accuracy Trade-off in Neural Architecture Search [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Li_Partial_Order_Pruning_For_Best_SpeedAccuracy_Trade-Off_in_Neural_Architecture_CVPR_2019_paper.html)] 146 | - Provable Filter Pruning for Efficient Neural Networks [[✏️paper](https://arxiv.org/abs/1911.07412)][⭐⭐[github](https://github.com/lucaslie/provable_pruning)] 147 | - Structured Pruning of Neural Networks with Budget-Aware Regularization [[✏️CVPR](https://arxiv.org/abs/1811.09332)] 148 | - The State of Sparsity in Deep Neural Networks [[✏️paper](https://arxiv.org/pdf/1902.09574.pdf)][⭐⭐[github](https://github.com/bayesgroup/variational-dropout-sparsifies-dnn)] 149 | - Towards Optimal Structured CNN Pruning via Generative Adversarial Learning [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Lin_Towards_Optimal_Structured_CNN_Pruning_via_Generative_Adversarial_Learning_CVPR_2019_paper.html)] 150 | - Variational Convolutional Neural Network Pruning [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Zhao_Variational_Convolutional_Neural_Network_Pruning_CVPR_2019_paper.html)] 151 | 152 | 153 | 154 | 2018 155 | 156 | - A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers [[✏️ECCV](https://openaccess.thecvf.com/content_ECCV_2018/html/Tianyun_Zhang_A_Systematic_DNN_ECCV_2018_paper.html)][⭐⭐[github](https://github.com/KaiqiZhang/admm-pruning)] 157 | - Accelerating Convolutional Networks via Global & Dynamic Filter Pruning [[✏️NIPS](https://www.ijcai.org/proceedings/2018/0336.pdf)] 158 | - Amc: Automl for model compression and acceleration on mobile devices [[✏️✏️ECCV](https://openaccess.thecvf.com/content_ECCV_2018/html/Yihui_He_AMC_Automated_Model_ECCV_2018_paper.html)][⭐⭐[github](https://github.com/mit-han-lab/amc)] 159 | - CLIP-Q: Deep Network Compression Learning by In-Parallel Pruning-Quantization [[✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2018/html/Tung_CLIP-Q_Deep_Network_CVPR_2018_paper.html)] 160 | - Constraint-Aware Deep Neural Network Compression [[✏️ECCV](https://openaccess.thecvf.com/content_ECCV_2018/html/Changan_Chen_Constraints_Matter_in_ECCV_2018_paper.html)] 161 | - Coreset-Based Neural Network Compression [[✏️ECCV](https://openaccess.thecvf.com/content_ECCV_2018/html/Abhimanyu_Dubey_Coreset-Based_Convolutional_Neural_ECCV_2018_paper.html)] 162 | - Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds [[✏️paper](https://arxiv.org/abs/1804.05345)] 163 | - Data-Driven Sparse Structure Selection for Deep Neural Networks [[✏️ECCV](https://arxiv.org/abs/1707.01213v3)][⭐⭐[github](https://github.com/huangzehao/sparse-structure-selection)] 164 | - Discrimination-aware Channel Pruning for Deep Neural Networks [[✏️NIPS](https://proceedings.neurips.cc/paper/2018/hash/55a7cf9c71f1c9c495413f934dd1a158-Abstract.html)] 165 | - Dynamic Channel Pruning: Feature Boosting and Suppression [[✏️paper](https://arxiv.org/abs/1810.05331)][⭐⭐[github](https://github.com/deep-fry/mayo)] 166 | - Dynamic Sparse Graph for Efficient Deep Learning [[paper](https://arxiv.org/abs/1810.00859)] 167 | - Frequency-Domain Dynamic Pruning for Convolutional Neural Networks [[✏️NIPS](https://proceedings.neurips.cc/paper/2018/hash/a9a6653e48976138166de32772b1bf40-Abstract.html)] 168 | - “Learning-Compression” Algorithms for Neural Net Pruning [[✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2018/html/Carreira-Perpinan_Learning-Compression_Algorithms_for_CVPR_2018_paper.html)] 169 | - Learning Sparse Neural Networks via Sensitivity-Driven Regularization [[✏️NIPS](https://proceedings.neurips.cc/paper/2018/hash/04df4d434d481c5bb723be1b6df1ee65-Abstract.html)] 170 | - Model Compression and Acceleration for Deep Neural Networks [[✏️IEEE](https://sci-hub.se/10.1109/MSP.2017.2765695)]" 171 | - NISP: Pruning Networks using Neuron Importance Score Propagation [[✏️✏️CVPR](https://arxiv.org/abs/1711.05908)] 172 | - PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning [[✏️✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2018/html/Mallya_PackNet_Adding_Multiple_CVPR_2018_paper.html)][⭐⭐[github](https://github.com/arunmallya/packnet)] 173 | - Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers [[✏️paper](https://arxiv.org/abs/1802.00124)][⭐⭐[github](https://github.com/bobye/batchnorm_prune)] 174 | - SNIP: Single-shot Network Pruning based on Connection Sensitivity [[✏️✏️paper](https://arxiv.org/abs/1810.02340)][⭐⭐[github](https://github.com/namhoonlee/snip-public)] 175 | - Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks [[✏️✏️paper](https://arxiv.org/abs/1808.06866)][⭐⭐[github](https://github.com/he-y/soft-filter-pruning)] 176 | - Stronger generalization bounds for deep nets via a compression approach [[✏️PMLR](http://proceedings.mlr.press/v80/arora18b/arora18b.pdf)] 177 | 178 | 179 | 2017 180 | 181 | - Channel pruning for accelerating very deep neural networks [[✏️✏️ICCV](http://openaccess.thecvf.com/content_iccv_2017/html/He_Channel_Pruning_for_ICCV_2017_paper.html)][⭐⭐[github](https://github.com/yihui-he/channel-pruning)] 182 | - Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning [[✏️✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2017/html/Yang_Designing_Energy-Efficient_Convolutional_CVPR_2017_paper.html)] 183 | - Efficient Processing of Deep Neural Networks: A Tutorial and Survey [[✏️✏️IEEE](https://sci-hub.se/10.1109/JPROC.2017.2761740)] 184 | - Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon [[✏️NIPS](https://proceedings.neurips.cc/paper/2017/hash/c5dc3e08849bec07e33ca353de62ea04-Abstract.html)] 185 | - Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee [[✏️NIPS](https://proceedings.neurips.cc/paper/2017/hash/3fab5890d8113d0b5a4178201dc842ad-Abstract.html)] 186 | - Runtime Neural Pruning [[✏️NIPS](https://proceedings.neurips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html)] 187 | 188 | 189 | 190 | 2016 191 | 192 | - Pruning Convolutional Neural Networks for Resource Efficient Inference [[✏️✏️paper](https://arxiv.org/abs/1611.06440)] 193 | 194 | 195 | 1989 196 | 197 | - Optimal Brain Damage (LeCun) [[✏️✏️NIPS](https://proceedings.neurips.cc/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf)] 198 |
199 | 200 | 201 | ## Courses, webinars and blogs 202 | 203 | Courses 204 | - Stanford 2020, Hardware Accelerators for Machine Learning [[paper](https://cs217.stanford.edu)] 205 | 206 | Webinars, video content 207 | - Cornell 2022, ML HW & Systems [[paper](https://www.youtube.com/watch?v=66HCPAcimGY&list=PL0mFAhrXqy9CuopJhAB8GVu_Oy7J0ery6)] 208 | - MIT 2021, Vivienne Sze & Lex Fridman, Efficient Computing for Deep Learning, Robotics, and AI [[paper]( https://www.youtube.com/watch?v=WbLQqPw_n88)] 209 | - Stanford 2017, Son Han, Efficient Methods and Hardware for Deep Learning [[paper](https://www.youtube.com/watch?v=eZdOkDtYMoo)] 210 | 211 | 212 | Blogs, written content 213 | - Adding Quantization-aware Training and Pruning to the TensorFlow Model Garden [[paper](https://blog.tensorflow.org/2022/06/Adding-Quantization-aware-Training-and-Pruning-to-the-TensorFlow-Model-Garden.html)] 214 | - A friendly introduction to machine learning compilers and optimizers [[paper](https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html)] 215 | - Faster Deep Learning Training with PyTorch – a 2021 Guide [[paper](https://efficientdl.com/faster-deep-learning-in-pytorch-a-guide/)] 216 | - Optimize machine learning models with Tensorflow [[paper](https://www.tensorflow.org/model_optimization)] 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 |

225 | Join the community • 226 | Contribute to the library 227 |

228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | -------------------------------------------------------------------------------- /quantization-overview.md: -------------------------------------------------------------------------------- 1 |

2 | Join the community • 3 | Contribute to the library 4 |

5 | 6 | 7 | 8 | 9 | 10 | # Overview 11 | 12 | Here we will explore 13 | - Fundamentals 14 | - Overview of quantization techniques 15 | - Papers, blogs and other resources 16 | 17 | ## What is neural network quantization? 18 | 19 | Over the years, the evolution of neural networks has progressed towards larger models. Although the accuracy of these models has increased significantly, these large models often cannot be implemented for many applications with limited resources that require real-time inference, low power consumption, or are limited in computing capacity or budget. 20 | 21 | To overcome this challenge, various types of optimizations are introduced, quantization being one of them. The intuition behind quantization is to create a more compact representation of the neural network by reducing the precision of weights, biases and/or activations below the standard floating-point precision. This technique reduces the memory required to store tensors, data movement between memories and processors and enables faster computations without sacrificing too much accuracy. 22 | 23 | ### Key takeways 24 | 25 | - Quantization is the process of reducing the precision of neural network weights, biases and/or activations to cut the cost of storing, transporting and computing data during neural network inference and training. 26 | - Quantization yields lower latency / higher throughput, reduced model size, memory footprint and power consumption, at the cost of small deterioration in model accuracy. 27 | - Quantization is primarily applied during inference, as training requires higher precision. It is also possible to perform quantization-aware training to directly produce a quantized model at the end of training. 28 | - Typically, quantization means mapping model parameters from 32 bits floating point to 16 or 8 bits, and it is also possible to go down to a 1-bit quantization. 29 | - Quantization requires hardware or software that can effectively support lower precision model parameters. Many frameworks such as PyTorch and TensorFlow support quantization natively, and libraries like nebullvm make you leverage quantization on any framework and hardware. 30 | 31 | ## More resources 32 | 33 | See the [resource page on quantization](https://github.com/nebuly-ai/learning-AI-optimization/blob/main/quantization-resources) and check out the best literature reviews, papers, courses and blogs, and open-source libraries on this topic. 34 | 35 | ## Quantization techniques 36 | 37 | Quantization techniques can be clustered by 38 | 39 | - [1. Type](#types) 40 | - [2. Granularity](#granularity) 41 | - [3. Tuning method](#fine-tuning-methods) 42 | - [4. Computational strategy](#computation-strategy) 43 | 44 | ```mermaid 45 | flowchart LR 46 | A[Quantization \n Inference \n Techniques] --> B[Type] 47 | B --> BA[Uniform] 48 | B --> BB[Non-Uniform] 49 | BA --> BAA[Symmetric] 50 | BA --> BAB[Asymmetric] 51 | BA --> BAC[Static] 52 | BA --> BAD[Dynamic] 53 | B --> BD[Stochastic] 54 | B --> BE[Extreme] 55 | A --> C[Granularity] 56 | C --> CA[Calibration] 57 | CA --> CAA[Layer] 58 | CA --> CAB[Group] 59 | CA --> CAC[Channel] 60 | CA --> CAD[Sub-Channel] 61 | C --> CB[Mixed \n Precision] 62 | A --> D[Tuning \n Methods] 63 | D --> DA[Fine-Tuning] 64 | DA --> DAA[QAT] 65 | DA --> DAB[PTQ] 66 | D --> DB[Tuning \n Strategy] 67 | DB --> DC[Quantized \n Gradients] 68 | DB --> DD[Distillation \n Assisted] 69 | DB --> DE[Hardware \n Aware] 70 | A --> E[Computation \n Strategy] 71 | E --> EA[Simulated] 72 | E --> EB[Integer \n Only] 73 | E --> EC[Dyadic] 74 | ``` 75 | 76 | ## 1. Types 77 | 78 | ### 1.1 Uniform 79 | General quantization can be expressed as: 80 | 81 | $$Q(r)=Int(r/S) - Z$$ 82 | 83 | Where: 84 | * $r$ is the floating point value to be quantized. 85 | * $S$ is the scaling factor that maps the floating point value in the appropiate range for quantization. 86 | * $Z$ is an integer applied to set the zero-point of the quantized representation. 87 | 88 | And is said uniform since the spacing between two quantized values is fixed and imposed by $S$. 89 | 90 | The scaling factor is computed as: 91 | 92 | $$S=\frac{\beta - \alpha}{2^b - 1}$$ 93 | 94 | Where: 95 | 96 | * $\beta$ is the upper bound that the quantized number needs to represent $r_{max}$. 97 | * $\alpha$ is the lower bound that the quantized number needs to represent $r_{min}$. 98 | * $b$ is the bit-width of the quantized value. 99 | 100 | The process of selecting $a$ and $b$ is said **calibration**. 101 | 102 | ### Symmetric 103 | In Symmetric Quantization 104 | $$|\beta| = |\alpha| = max(|r_{max}|,|r_{min}|)$$ 105 | 106 | This has the advantage of eliminating the need to compute $Z$ since the mapping is already centered in $0$. However, some of the resolution is wasted on representing values that are not interesting, since typically $|r_{max}| \neq|r_{min}|$. 107 | 108 | ### Asymmetric 109 | In Asymmetric Quantization the most trivial choise could be: 110 | 111 | $$\beta = r_{max}$$ 112 | $$\alpha = r_{min}$$ 113 | 114 | However more complex strategy can be used to improve the resolution at the price of losing some representation capability for the extreme values. Typical choise are: 115 | * picking a percentile of the full-range. 116 | * minimizing the K/L divergence. 117 | 118 | The parameter $Z$ must be computed, but the accuracy of the representation is improved. This strategy is particularly effective when the constraints are strongly asymmetric, as in the case of ReLU, and $Z$ can be incorporated into the activation bias. 119 | 120 | ### Static and Dynamic 121 | The weights, once calculated as quantized values, their value does not change and the quantization is always static. However, the quantization of activation maps depends on the inputs and the range is not known a priori. Several strategies can be adopted in this case: 122 | 123 | * *Dynamic Quantization:* the ranges are computed during run-time introducing overhead but high accuracy. 124 | * Run a series of input samples to determine resonable ranges. 125 | * Calibrate the quantization during training. 126 | 127 | ### 1.2 Non-Uniform 128 | Instead of using a linear scale more complex strategies can be used: 129 | 130 | * Logaritmic scale 131 | * code-based: the quantization is expressed as a real linear combination of binary vectors. 132 | * computing ad-hoc scales by minimizing an optimization problem that is jointly trained with the model. 133 | 134 | ### 1.3 Stochastic 135 | Noting that small updates to the weights may not result in any change, since the rounding operation may always return the same weights. Different approaches can be used to avoid this situation, such as imposing that the rounding operation has a certain probability of generating an upper or lower value in the quantized scale with a certain probability. The disadvantage is that generating random numbers and evaluating this probability distribution involves some overhead. 136 | 137 | ### 1.4 Extreme 138 | Extreme quantization occurs when the amplitude of the quantization bits is set as binary $[-1,1]$ or ternary $[-1,0,1]$ in combination with binary or ternary activation maps; usually ReLu and other activations can also be transformed into a $sign$ function. The increase in overall performance can be considerable if the hardware is well optimized for these operations; in practice any multiplication can be transformed into a simple logical XOR. Usually this type of approach also requires the use of mixed precision to avoid the loss of too much accuracy. Ternary quantization allows the zero representation to be retained, which can be used to avoid unnecessary calculations or to perform pruning and further optimization. 139 | 140 | ## 2.0 Granularity 141 | 142 | ### 2.1 Calibration Granularity 143 | The granualarity used to compute the ranges $[\alpha,\beta]$ can be different: 144 | * **Layer** 145 | * **Group:** a subset of channel. 146 | * **Channel** 147 | * **Sub-Channel:** any group of parameters. 148 | 149 | Usually the standard is to use the channel. 150 | 151 | ### 2.2 Mixed Precision 152 | Each layer is quantized with a different bit precision and is usually done when using binary or ternary representations for quantization. The selection of this mixed precision for each layer is essentially a search problem. Typical approaches to selecting the number of bits for layer quantization are: 153 | 154 | * Reinforment Learning. 155 | * Second order sensitivity (Hessian) can measure the sensitivity of a layer to quantization. 156 | * Linear Programming approaches. 157 | 158 | 159 | ## 3.1 Fine-Tuning Methods 160 | 161 | ### PTQ - Post Training Quantization 162 | PTQ consists of floating-point training and quantizing only at the end. The drop in accuracy is usually nontrivial, and some methods are used to reduce the impact of quantization: 163 | 164 | * Equalizing the weight ranges (and implicitly activation ranges) between different layers or channels. 165 | * Optimizing the L2 distance between the quantized tensor and the corresponding floating point tensor. 166 | * Adapting rounding strategies instead of round-to-nearest. 167 | 168 | ### QAT - Quantization Aware Training 169 | The main idea is to train again with quantized parameters to improve accuracy. The parameters are quantized after each gradient update. To recover accuracy, it may be necessary to perform this retraining for several hundred epochs, especially in the case of low bit-precision quantization. However, QAT produces higher accuracy than PTQ at the expense of higher overhead. 170 | 171 | The quantization operator is not differentiable and is typically flat piecewise, so it has zero gradient. The so-called Straight Through Estimator (STE) is used to solve this problem. 172 | The main idea is to treat the quantization gradient as an identity function. Other STE and non-STE approaches have also been proposed. 173 | 174 | ## 3.2 Tuning Strategy 175 | 176 | ### Hardware-Aware Quantization 177 | The benefits of quantization are hardware-dependent, so the recent trend is to identify the type of quantization that can be performed based on the hardware specifications on which the model is deployed. A typical approach is to use an agent that learns which quantization steps can be performed. 178 | 179 | ### Distillation Assisted 180 | The main idea is to incorporate model distillation to boost quantization accuracy. 181 | 182 | ### Improved Loss Function 183 | Directly minimize the loss versus weights to improve the accuracy of determining the correct weights versus loss function instead of the accuracy of representing floating-point values. 184 | 185 | ### Quantized Gradient 186 | To improve training time one apporach proposed is to quantize the gradients as well. 187 | 188 | ## 4.0 Computation Strategy 189 | 190 | ### 4.1 Simulated 191 | In simulated quantization, quantized model parameters are stored in low precision, but operations (matrix multiplications and convolutions) are performed with floating-point arithmetic. Therefore, the quantized parameters must be dequantized before the floating-point operations. 192 | 193 | ### 4.2 Integer-Only 194 | In integer-only quantization, all operations are performed using low-precision integer arithmetic. Low-precision logic has many advantages over its full-precision counterpart in terms of latency, power consumption and area efficiency. 195 | 196 | ### 4.3 Dyadic 197 | All scaling operations are performed with dyadic numbers, that is, rational numbers with integer values at the numerator and a power of 2 at the denominator. This means that scaling can be performed simply by applying bitshift operations, which are extremely more efficient than divisions. All addition must have the same dyadic scale, which makes the logic of addition simpler and more efficient. 198 | 199 | 200 | # Interesting material on quantization 201 | 202 | ## HAWQV3: Dyadic Neural Network Quantization 203 | 204 | [This Berkeley paper](https://arxiv.org/abs/2011.10680) proposes an approach to quantization with mixed dyadic symmetric and asymmetric quantization. It suggests an interesting strategy for selecting the quantization scheme of different layers using hardware-dependent metrics. The library is available at [HAWQ GitHub](https://github.com/Zhen-Dong/HAWQ/blob/main/ILP.ipynb). 205 | 206 | The paper opens up many questions, and below are just a few of them. 207 | 208 | * What are the limitations of the approach proposed in the paper? Is there room for improvement? 209 | * Are there better schemes for performing quantization? 210 | * Is it possible to implement this quantization approach in [nebullvm](https://github.com/nebuly-ai/nebullvm)? 211 | 212 | Let us know your views on this topic, either by opening an issue or dropping a message on the community channel. 213 | 214 | 215 | 216 | 217 | 218 |

219 | Join the community • 220 | Contribute to the library 221 |

222 | -------------------------------------------------------------------------------- /quantization-resources.md: -------------------------------------------------------------------------------- 1 | 2 |

3 | Join the community • 4 | Contribute to the library 5 |

6 | 7 | 8 | 9 | 10 | 11 | # Resources 12 | 13 | Awesome resource collection on quantization techniques. 14 | 15 | - Literature reviews and papers 16 | - Courses, webinars and blogs 17 | 18 | If you still didn't, check [Overview page on quantization](https://github.com/nebuly-ai/learning-AI-optimization/blob/main/quantization-resources) or [Exploring AI optimization](https://github.com/nebuly-ai/exploring-AI-optimization) to start getting into it. 19 | 20 | 21 | 22 | ## Literature reviews and papers 23 | Legend: 24 | - ✏️ 100-499 citations, ✏️✏️ $\geq$ 500 citations 25 | - ⭐ 100-249 stars, ⭐⭐ $\geq$ 250 stars 26 | 27 | Sorting: chronological/alphabetic order 28 |
29 | 30 | ### Literature reviews 31 | 32 | - A Comprehensive Survey on Model Quantization for Deep Neural Networks [[paper](https://arxiv.org/abs/2205.07877)] 33 | - A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off [[NeurIPS](https://proceedings.neurips.cc/paper/2019/hash/38ef4b66cb25e92abe4d594acb841471-Abstract.html)] 34 | - A Survey of Quantization Methods for Efficient Neural Network Inference [[✏️paper](https://arxiv.org/abs/2103.13630)] 35 | - Analysis of Quantized Models [[ICLR](https://openreview.net/forum?id=ryM_IoAqYX)] 36 | - Binary Neural Networks: A Survey [[✏️paper](https://www.sciencedirect.com/science/article/abs/pii/S0031320320300856)] 37 | - Compression of Deep Learning Models for Text: A Survey [[✏️paper](https://arxiv.org/pdf/2008.05221.pdf)] 38 | 39 | 40 | ### Papers 41 | 2022 42 | - 8-bit Optimizers via Block-wise Quantization [[ICLR](https://arxiv.org/abs/2110.02861)] 43 | - F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization [[ICLR](https://arxiv.org/abs/2202.05239)] 44 | - FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer [[IJCAI](https://arxiv.org/abs/2111.13824)][[github](https://github.com/megvii-research/FQ-ViT)] 45 | - Learnable Lookup Table for Neural Network Quantization [[CVPR](https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Learnable_Lookup_Table_for_Neural_Network_Quantization_CVPR_2022_paper.html)] 46 | - MultiQuant: Training Once for Multi-bit Quantization of Neural Networks [[paper](https://www.ijcai.org/proceedings/2022/0504.pdf)] 47 | - Q-ViT: Fully Differentiable Quantization for Vision Transformer [[paper](https://arxiv.org/abs/2201.07703)] 48 | - QDrop: Randomly Dropping Quantization for Extremely Low-bit Post-Training Quantization [[paper](https://arxiv.org/abs/2203.05740)] 49 | - RAPQ: Rescuing Accuracy for Power-of-Two Low-bit Post-training Quantization [[IJCAI](https://arxiv.org/abs/2204.12322)] 50 | - SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation [[ICLR](https://arxiv.org/abs/2202.07471)] 51 | - Toward Efficient Low-Precision Training: Data Format Optimization and Hysteresis Quantization [[ICLR](https://openreview.net/forum?id=3HJOA-1hb0e)] 52 | 53 | 54 | 2021 55 | - A Winning Hand: Compressing Deep Networks Can Improve Out-of-Distribution Robustness [[NeurIPS](https://proceedings.neurips.cc/paper/2021/hash/0607f4c705595b911a4f3e7a127b44e0-Abstract.html)] 56 | - Any-Precision Deep Neural Networks [[paper](https://ojs.aaai.org/index.php/AAAI/article/view/17286)] 57 | - Binary Graph Neural Networks [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Bahri_Binary_Graph_Neural_Networks_CVPR_2021_paper.html)] 58 | - BiPointNet: Binary Neural Network for Point Clouds [[paper](https://arxiv.org/abs/2010.05501)] 59 | - BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction [[✏️paper](https://arxiv.org/abs/2102.05426)] 60 | - BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization [[ICLR](https://arxiv.org/abs/2102.10462)] 61 | - Degree-Quant: Quantization-Aware Training for Graph Neural Networks [[ICLR](https://arxiv.org/abs/2008.05000)] 62 | - Diversifying Sample Generation for Accurate Data-Free Quantization [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Zhang_Diversifying_Sample_Generation_for_Accurate_Data-Free_Quantization_CVPR_2021_paper.html)] 63 | - FracBits: Mixed Precision Quantization via Fractional Bit-Widths [[paper](https://ojs.aaai.org/index.php/AAAI/article/view/17269)] 64 | - High-Capacity Expert Binary Networks [[ICLR](https://arxiv.org/abs/2010.03558)] 65 | - Incremental few-shot learning via vector quantization in deep embedded space [[ICLR](https://openreview.net/forum?id=3SV-ZePhnZM)] 66 | - Learnable Companding Quantization for Accurate Low-bit Neural Networks [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Yamamoto_Learnable_Companding_Quantization_for_Accurate_Low-Bit_Neural_Networks_CVPR_2021_paper.html)] 67 | - MQBench: Towards Reproducible and Deployable Model Quantization Benchmark [[paper](https://arxiv.org/abs/2111.03759)][[github](http://mqbench.tech/)] 68 | - Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network [[paper](https://arxiv.org/abs/2103.09377)] 69 | - Network Quantization with Element-wise Gradient Scaling [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Lee_Network_Quantization_With_Element-Wise_Gradient_Scaling_CVPR_2021_paper.html)] 70 | - OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization [[AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/16950)] 71 | - Post-Training Quantization for Vision Transformer [[NeurIPS](https://proceedings.neurips.cc/paper/2021/hash/ec8956637a99787bd197eacd77acce5e-Abstract.html)] 72 | - Post-­training Quantization with Multiple Points: Mixed Precision without Mixed Precision [[paper](https://ojs.aaai.org/index.php/AAAI/article/view/17054)] 73 | - Post-Training Sparsity-Aware Quantization [[NeurIPS](https://proceedings.neurips.cc/paper/2021/hash/9431c87f273e507e6040fcb07dcb4509-Abstract.html)] 74 | - ReCU: Reviving the Dead Weights in Binary Neural Networks [[ICCV](https://openaccess.thecvf.com/content/ICCV2021/html/Xu_ReCU_Reviving_the_Dead_Weights_in_Binary_Neural_Networks_ICCV_2021_paper.html)] 75 | - Stochastic Precision Ensemble: Self‐Knowledge Distillation for Quantized Deep Neural Networks [[paper](https://ojs.aaai.org/index.php/AAAI/article/view/16839)] 76 | - Training with Quantization Noise for Extreme Model Compression [[✏️paper](https://arxiv.org/abs/2004.07320)][[github](https://github.com/facebookresearch/fairseq/tree/main/examples/quant_noise)] 77 | - TRQ: Ternary Neural Networks with Residual Quantization [[paper](https://ojs.aaai.org/index.php/AAAI/article/view/17036)] 78 | - Zero-shot Adversarial Quantization [[CVPR](https://openaccess.thecvf.com/content/CVPR2021/html/Liu_Zero-Shot_Adversarial_Quantization_CVPR_2021_paper.html)] 79 | 80 | 81 | 2020 82 | - APQ: Joint Search for Network Architecture, Pruning and Quantization Policy [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Wang_APQ_Joint_Search_for_Network_Architecture_Pruning_and_Quantization_Policy_CVPR_2020_paper.html)][⭐[github](https://github.com/mit-han-lab/apq)] 83 | - AutoQ: Automated Kernel-Wise Neural Network Quantization [[paper](https://arxiv.org/abs/1902.05690)] 84 | - Balanced Binary Neural Networks with Gated Residual [[IEEE](https://ieeexplore.ieee.org/abstract/document/9054599)] 85 | - BiDet: An Efficient Binarized Object Detector [[CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Wang_BiDet_An_Efficient_Binarized_Object_Detector_CVPR_2020_paper.html)][⭐[github](https://github.com/ZiweiWangTHU/BiDet)] 86 | - BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations [[paper](https://arxiv.org/abs/2002.06517)] 87 | - Differentiable Joint Pruning and Quantization for Hardware Efficiency [[ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58526-6_16)] 88 | - Forward and Backward Information Retention for Accurate Binary Neural Networks [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Qin_Forward_and_Backward_Information_Retention_for_Accurate_Binary_Neural_Networks_CVPR_2020_paper.html)][⭐[github](https://github.com/htqin/IR-Net)] 89 | - GhostNet: More Features from Cheap Operations [[✏️✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Han_GhostNet_More_Features_From_Cheap_Operations_CVPR_2020_paper.html)][⭐⭐[github](https://github.com/huawei-noah/Efficient-AI-Backbones)] 90 | - Gradient ℓ1 Regularization for Quantization Robustness [[ICLR](https://arxiv.org/abs/2002.07520)] 91 | - Learned Step Size Quantization [[✏️ICLR](https://arxiv.org/abs/1902.08153)] 92 | - Linear Symmetric Quantization of Neural Networks for Low-precision Integer Hardware [[ICML](https://openreview.net/pdf?id=H1lBj2VFPS)] 93 | - MeliusNet: Can Binary Neural Networks Achieve MobileNet-level Accuracy? [[✏️paper](https://arxiv.org/abs/2001.05936)][⭐[github](https://github.com/hpi-xnor/BMXNet-v2)] 94 | - Mixed Precision DNNs: All you need is a good parametrization [[✏️ICLR](https://arxiv.org/abs/1905.11452)][⭐[github](https://github.com/sony/ai-research-code)] 95 | - PROFIT: A Novel Training Method for sub-4-bit MobileNet Models [[ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58539-6_26)] 96 | - Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [[✏️AAAI](https://ojs.aaai.org/index.php/AAAI/article/view/6409)] 97 | - ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [[✏️ECCV](https://link.springer.com/chapter/10.1007/978-3-030-58568-6_9)] 98 | - Riptide: Fast End-to-End Binarized Neural Networks [[MLSys](https://proceedings.mlsys.org/paper/2020/hash/2a79ea27c279e471f4d180b08d62b00a-Abstract.html)][⭐[github](https://github.com/jwfromm/Riptide)] 99 | - Rotation Consistent Margin Loss for Efficient Low-Bit Face Recognition [[CVPR](https://openaccess.thecvf.com/content_CVPR_2020/html/Wu_Rotation_Consistent_Margin_Loss_for_Efficient_Low-Bit_Face_Recognition_CVPR_2020_paper.html)] 100 | - Soft Weight-Sharing for Neural Network Compression [[✏️ICLR](https://arxiv.org/abs/1702.04008)] 101 | 102 | 2019 103 | - Additive Powers-of-Two Quantization: A Non-uniform Discretization for Neural Networks [[✏️paper](https://arxiv.org/abs/1909.13144)][⭐[github](https://github.com/yhhhli/APoT_Quantization)] 104 | - An Empirical study of Binary Neural Networks' Optimisation [[✏️ICLR](https://openreview.net/forum?id=rJfUCoR5KX)][[github](https://github.com/mil-ad/studying-binary-neural-networks)] 105 | - ARM: Augment-REINFORCE-Merge Gradient for Stochastic Binary Networks [[✏️ICLR](https://arxiv.org/abs/1807.11143)] 106 | - Binary Ensemble Neural Network: More Bits per Network or More Networks per Bit? [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Zhu_Binary_Ensemble_Neural_Network_More_Bits_per_Network_or_More_CVPR_2019_paper.html)] 107 | - Circulant Binary Convolutional Networks: Enhancing the Performance of 1-Bit DCNNs With Circulant Back Propagation [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Liu_Circulant_Binary_Convolutional_Networks_Enhancing_the_Performance_of_1-Bit_DCNNs_CVPR_2019_paper.html)] 108 | - Data-Free Quantization through Weight Equalization and Bias Correction [[✏️ICCV](https://openaccess.thecvf.com/content_ICCV_2019/html/Nagel_Data-Free_Quantization_Through_Weight_Equalization_and_Bias_Correction_ICCV_2019_paper.html)] 109 | - Defensive Quantization: When Efficiency Meets Robustness [[✏️paper](https://arxiv.org/abs/1904.08444)] 110 | - Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks [[✏️ICCV](https://openaccess.thecvf.com/content_ICCV_2019/html/Gong_Differentiable_Soft_Quantization_Bridging_Full-Precision_and_Low-Bit_Neural_Networks_ICCV_2019_paper.html)] 111 | - ECC: Platform-Independent Energy-Constrained Deep Neural Network Compression via a Bilinear Regression Model [[CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Yang_ECC_Platform-Independent_Energy-Constrained_Deep_Neural_Network_Compression_via_a_Bilinear_CVPR_2019_paper.html)] 112 | - Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices [[✏️✏️IEEE](https://ieeexplore.ieee.org/abstract/document/8686088)] 113 | - Fully Quantized Network for Object Detection [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Li_Fully_Quantized_Network_for_Object_Detection_CVPR_2019_paper.html)] 114 | - Fully Quantized Transformer for Machine Translation [[paper](V)] 115 | - HAQ: Hardware-Aware Automated Quantization With Mixed Precision [[✏️✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Wang_HAQ_Hardware-Aware_Automated_Quantization_With_Mixed_Precision_CVPR_2019_paper.html)] 116 | - Improving Neural Network Quantization without Retraining using Outlier Channel Splitting [[✏️PMLR](http://proceedings.mlr.press/v97/zhao19c.html)] 117 | - Latent Weights Do Not Exist: Rethinking Binarized Neural Network Optimization [[✏️NeurIPS](https://proceedings.neurips.cc/paper/2019/hash/9ca8c9b0996bbf05ae7753d34667a6fd-Abstract.html)] 118 | - Learning Channel-Wise Interactions for Binary Convolutional Neural Networks [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Learning_Channel-Wise_Interactions_for_Binary_Convolutional_Neural_Networks_CVPR_2019_paper.html)] 119 | - Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Jung_Learning_to_Quantize_Deep_Networks_by_Optimizing_Quantization_Intervals_With_CVPR_2019_paper.html)] 120 | - Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm [[ICLR](https://arxiv.org/abs/1812.11732)] 121 | - Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations [[✏️NeurIPS](https://proceedings.neurips.cc/paper/2019/hash/d202ed5bcfa858c15a9f383c3e386ab2-Abstract.html)] 122 | - Quantization Networks [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Yang_Quantization_Networks_CVPR_2019_paper.html)][⭐[github](https://github.com/aliyun/alibabacloud-quantization-networks)] 123 | - Regularizing Activation Distribution for Training Binarized Deep Networks [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Ding_Regularizing_Activation_Distribution_for_Training_Binarized_Deep_Networks_CVPR_2019_paper.html)] 124 | - Same, Same But Different: Recovering Neural Network Quantization Error Through Weight Factorization [[✏️PMLR](https://proceedings.mlr.press/v97/meller19a.html)] 125 | - SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity Through Low-Bit Quantization [[CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/Cao_SeerNet_Predicting_Convolutional_Neural_Network_Feature-Map_Sparsity_Through_Low-Bit_Quantization_CVPR_2019_paper.html)] 126 | - Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation [[✏️CVPR](https://openaccess.thecvf.com/content_CVPR_2019/html/He_Simultaneously_Optimizing_Weight_and_Quantizer_of_Ternary_Neural_Network_Using_CVPR_2019_paper.html)] 127 | 128 | 2018 129 | - Adaptive Quantization of Neural Networks [[ICLR](https://openreview.net/forum?id=SyOK1Sg0W)] 130 | - Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm [[✏️ECCV](https://openaccess.thecvf.com/content_ECCV_2018/html/zechun_liu_Bi-Real_Net_Enhancing_ECCV_2018_paper.html)][⭐[github](https://github.com/liuzechun/Bi-Real-net)] 131 | - BinaryRelax: A Relaxation Approach For Training Deep Neural Networks With Quantized Weights [[✏️SIAM](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=BinaryRelax%3A+A+Relaxation+Approach+For+Training+Deep+Neural+Networks+With+Quantized+Weights&btnG=)] 132 | - CLIP-Q: Deep Network Compression Learning by In-Parallel Pruning-Quantization [[✏️CVPR](Q_Deep_Network_CVPR_2018_paper)] 133 | - Combinatorial Attacks on Binarized Neural Networks [[paper](https://arxiv.org/abs/1810.03538)] 134 | - DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [[✏️✏️paper](https://arxiv.org/abs/1606.06160)] 135 | - Explicit Loss-Error-Aware Quantization for Low-Bit Deep Neural Networks [[✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2018/html/Zhou_Explicit_Loss-Error-Aware_Quantization_CVPR_2018_paper.html)] 136 | - FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks [[✏️paper](https://dl.acm.org/doi/abs/10.1145/3242897)] 137 | - FP-BNN: Binarized neural network on FPGA [[✏️paper](https://www.sciencedirect.com/science/article/abs/pii/S0925231217315655)] 138 | - Heterogeneous Bitwidth Binarization in Convolutional Neural Networks [[NeurIPS](https://proceedings.neurips.cc/paper/2018/hash/1b36ea1c9b7a1c3ad668b8bb5df7963f-Abstract.html)] 139 | - HitNet: Hybrid Ternary Recurrent Neural Network [[NeurIPS](https://proceedings.neurips.cc/paper/2018/hash/82cec96096d4281b7c95cd7e74623496-Abstract.html)] 140 | - Learning Discrete Weights Using the Local Reparameterization Trick [[✏️ICLR](https://arxiv.org/abs/1710.07739)] 141 | - Loss-aware Weight Quantization of Deep Networks [[✏️paper](https://arxiv.org/abs/1802.08635)] 142 | - LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks [[✏️ECCV](https://openaccess.thecvf.com/content_ECCV_2018/html/Dongqing_Zhang_Optimized_Quantization_for_ECCV_2018_paper.html)] 143 | - Model compression via distillation and quantization [[✏️ICLR](https://arxiv.org/abs/1802.05668)][⭐⭐[github](https://github.com/antspy/quantized_distillation)] 144 | - PACT: Parameterized Clipping Activation for Quantized Neural Networks [[✏️✏️paper](https://arxiv.org/abs/1805.06085)] 145 | - ProxQuant: Quantized Neural Networks via Proximal Operators [[✏️paper](https://arxiv.org/abs/1810.00861)] 146 | - Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference [[✏️✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2018/html/Jacob_Quantization_and_Training_CVPR_2018_paper.html)] 147 | - Quantizing deep convolutional networks for efficient inference: A whitepaper [[✏️✏️paper](https://arxiv.org/abs/1806.08342)] 148 | - Relaxed Quantization for Discretized Neural Networks [[✏️paper](https://arxiv.org/abs/1810.01875)] 149 | - Scalable Methods for 8-bit Training of Neural Networks [[✏️NeurIPS](https://proceedings.neurips.cc/paper/2018/hash/e82c4b19b8151ddc25d4d93baf7b908f-Abstract.html)] 150 | - SIGNSGD: compressed optimisation for non-convex problems [[✏️✏️PMLR](https://proceedings.mlr.press/v80/bernstein18a.html)] 151 | 152 | 2017 153 | - Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy [[✏️paper](https://arxiv.org/abs/1711.05852)] 154 | - Binarized Convolutional Landmark Localizers for Human Pose Estimation and Face Alignment with Limited Resources [[✏️ICCV](https://openaccess.thecvf.com/content_iccv_2017/html/Bulat_Binarized_Convolutional_Landmark_ICCV_2017_paper.html)] 155 | - Deep Learning with Low Precision by Half-wave Gaussian Quantization [[✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2017/html/Cai_Deep_Learning_With_CVPR_2017_paper.html)] 156 | - Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM [[✏️paper](https://arxiv.org/abs/1707.09870)] 157 | - FINN: A Framework for Fast, Scalable Binarized Neural Network Inference [[✏️✏️paper](https://dl.acm.org/doi/abs/10.1145/3020078.3021744)] 158 | - Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights [[✏️✏️ICLR](https://arxiv.org/abs/1702.03044)][⭐[github](https://github.com/AojunZhou/Incremental-Network-Quantization)] 159 | - Local Binary Convolutional Neural Networks [[✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2017/html/Juefei-Xu_Local_Binary_Convolutional_CVPR_2017_paper.html)][[github](https://github.com/juefeix/lbcnn.torch)] 160 | - Network Sketching: Exploiting Binary Structure in Deep CNNs [[✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2017/html/Guo_Network_Sketching_Exploiting_CVPR_2017_paper.html)] 161 | - QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding [[✏️✏️NIPS](https://proceedings.neurips.cc/paper/2017/file/6c340f25839e6acdc73414517203f5f0-Paper.pdf)] 162 | - Quantized Neural Networks Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations [[✏️✏️paper](https://www.jmlr.org/papers/volume18/16-456/16-456.pdf)] 163 | 164 | Before 2017 165 | - Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 [[✏️✏️paper](https://arxiv.org/abs/1602.02830)][⭐⭐[github](https://github.com/itayhubara/BinaryNet)] 166 | - Bitwise Neural Networks [[✏️ICML](https://arxiv.org/abs/1601.06071)] 167 | - Loss-aware Binarization of Deep Networks [[✏️paper](https://arxiv.org/abs/1611.01600)] 168 | - Overcoming Challenges in Fixed Point Training of Deep Convolutional Networks [[ICML](https://arxiv.org/abs/1607.02241)] 169 | - Quantized Convolutional Neural Networks for Mobile Devices [[✏️✏️CVPR](https://openaccess.thecvf.com/content_cvpr_2016/html/Wu_Quantized_Convolutional_Neural_CVPR_2016_paper.html)] 170 | - BinaryConnect: Training Deep Neural Networks with binary weights during propagations [[✏️✏️NIPS](https://arxiv.org/abs/1511.00363)][⭐⭐[github](https://github.com/MatthieuCourbariaux/BinaryConnect)] 171 | - Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights [[✏️NIPS](https://proceedings.neurips.cc/paper/2014/hash/076a0c97d09cf1a0ec3e19c7f2529f2b-Abstract.html)] 172 | - Back to Simplicity: How to Train Accurate BNNs from Scratch? [[paper](https://arxiv.org/abs/1906.08637)][⭐[github](https://github.com/hpi-xnor/BMXNet-v2)] 173 | 174 | 175 | 176 | 177 | ## Courses, webinars and blogs 178 | 179 | Webinars, video content 180 | - [Codebasics, Quantization in deep learning (Tensorflow, Keras & Python)](https://www.youtube.com/watch?v=v1oHf1KV6kM) 181 | - [Hailo, Quantization of Neural Networks – High Accuracy at Low Precision](https://www.youtube.com/watch?v=DhHyshhc1lY) 182 | - [Sony, Downsizing Neural Networks by Quantization](https://www.youtube.com/watch?v=DDelqfkYCuo) 183 | - [tinyML, A Practical Guide to Neural Network Quantization](https://www.youtube.com/watch?v=KASuxB3XoYQ) 184 | 185 | Blogs, written content 186 | - [How to accelerate and compress neural networks with quantization](https://towardsdatascience.com/how-to-accelerate-and-compress-neural-networks-with-quantization-edfbbabb6af7) 187 | - [Matlab, Quantization of Deep Neural Networks](https://www.mathworks.com/help/deeplearning/ug/quantization-of-deep-neural-networks.html) 188 | - [Quantization for Neural Networks](https://leimao.github.io/article/Neural-Networks-Quantization/) 189 | - [TinyML, Neural Network Quantization](https://www.allaboutcircuits.com/technical-articles/neural-network-quantization-what-is-it-and-how-does-it-relate-to-tiny-machine-learning/#:~:text=What%20is%20Quantization%20for%20Neural,that%20they%20consume%20less%20memory.) 190 | 191 | 192 | 193 | 194 | 195 | 196 |

197 | Join the community • 198 | Contribute to the library 199 |

200 | -------------------------------------------------------------------------------- /weekly-insights.md: -------------------------------------------------------------------------------- 1 |

2 | Subscribe to the newsletter • 3 | Join the community 4 |

5 | 6 | 7 | 8 | 9 | # Weekly insights from top papers on AI 10 | 11 | Welcome to our library of the best insights from the best papers on AI and machine learning. A new paper will be added every Friday.
12 | Don't hesitate to [open an issue](https://github.com/nebuly-ai/exploring-AI-optimization/issues) and submit a paper that you found interesting and the 3 key takeaways. 13 | 14 | 15 | ## Week #9: [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/pdf/2302.06675.pdf) 16 | 17 | - The paper introduces KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, follow instructions (i.e., zero-shot learning), and learn in context (i.e., few-shot learning). 18 | - KOSMOS-1 was trained from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data, and achieved impressive performance on various tasks, such as language understanding, generation, perception-language tasks, and vision tasks. 19 | - MLLMs benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. Additionally, the paper introduces a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs. 20 | 21 | 22 | ## Week #8: [Big Little Transformer Decoder](https://arxiv.org/pdf/2302.06675.pdf) 23 | 24 | - The paper presents a method to reduce the latency of autoregressive transformer models while preserving accuracy. The framework consists of a smaller model that is assisted in generation by a larger model. 25 | - The framework is built around the observation that by correcting only the erroneous tokens of the smaller model using the prediction of the larger model, it is possible to preserve the accuracy of the latter. Based on this observation, a policy is created that decides when the smaller model needs help to generate. 26 | - Only the smaller model is used as an autoregressive model, while the larger one is used to predict the tokens of the whole produced sequence, increasing the arithmetic intensity and thus reducing the overall latency of the generation. 27 | 28 | 29 | ## Week #7: [Symbolic Discovery of Optimization Algorithms](https://arxiv.org/pdf/2302.06675.pdf) 30 | 31 | - The paper presents a method to discover optimization algorithms for deep neural network training by formulating algorithm discovery as program search, leveraging efficient search techniques to explore an infinite and sparse program space. 32 | - The method introduces program selection and simplification strategies to bridge the large generalization gap between proxy and target tasks. 33 | - The discovered algorithm, Lion (EvoLved Sign Momentum), is more memory-efficient than Adam and achieves better performance on a variety of tasks, including image classification, vision-language contrastive learning, diffusion models, autoregressive, masked language modeling, and fine-tuning. 34 | 35 | 36 | ## Week #6: [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/pdf/2302.05543.pdf) 37 | 38 | - The paper presents a neural network architecture called ControlNet that can control large image diffusion models (like Stable Diffusion) to learn task-specific input conditions. 39 | - The ControlNet consists of a "trainable copy" and a "locked copy" of a large diffusion model, connected by a unique type of convolution layer called "zero convolution", which allows for end-to-end learning while preserving the generalization ability of the original model. 40 | - ControlNet can be trained on small datasets (even <1k samples) and on personal devices, and can still achieve competitive results with models trained on large computation clusters with terabytes of GPU memory and thousands of GPU hours. 41 | 42 | 43 | ## Week #5: [Multimodal Chain-of-Thought Reasoning in Language Models](https://arxiv.org/pdf/2302.00923v2.pdf) 44 | 45 | - The paper introduces Multimodal-CoT, a framework for incorporating vision signals in chain-of-thought (CoT) reasoning with 1B-models. The framework decouples rationale generation and answer inference into two stages, and incorporates vision features to help generate more effective rationales for more accurate answer inference. 46 | - The authors compare Multimodal-CoT with GPT-3.5 on the ScienceQA benchmark and show that their approach surpasses GPT-3.5 by 16% accuracy. They also show that different vision features and backbone models affect the performance of the model, and that DETR (based on object detection) is the best performing vision feature. 47 | - The authors perform a manual error analysis on randomly selected examples generated by Multimodal-CoT. They find that the majority of the errors are due to factual mistakes (failures in understanding maps and counting numbers in the images), commonsense mistakes (answering questions that require commonsense knowledge), and logical mistakes (contradictions in the reasoning chains). 48 | 49 | 50 | ## Week #4: [MusicLM: Generating Music From Text](https://arxiv.org/pdf/2301.11325.pdf) 51 | 52 | - MusicLM is a text-conditioned generative model that can produce high-quality music. The model is trained on a synthetic dataset of audio pairs with matching melodies and different acoustics, as well as data pairs of people humming and singing. The text description is used as a conditioning signal to guide the music generation process. 53 | - The model is able to generate music that follows the target melody contained in the input audio clip, while also being faithful to the text description. MusicLM is capable of generating long, coherent audio sequences that are semantically plausible and consistent with the text description. The model can also be used in "story mode," where the text description changes over time, leading to smooth transitions in the generated music. 54 | - There are several risks associated with MusicLM and its use-case, such as the reflection of biases present in the training data and the potential misappropriation of creative content. The authors conducted a thorough study of memorization and found that only a small fraction of examples was memorized exactly. 55 | 56 | 57 | ## Week #3: [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf) 58 | 59 | - The ability of a Large Language Model (LLM) to produce a better quality response is closely related to the prompt used. For example, providing the model with an example or a chain of thoughts (CoT) as a prompt produces a higher quality response without training; this type of technique is usually called InContextLearning, since the model learns from the provided context instead of updating parameters. 60 | - The chain-of-thought prompt is particularly useful in arithmetic reasoning, and it has been shown that for symbolic reasoning, the chain-of-thought prompt facilitates generalization of the out-of-distribution to longer sequences. 61 | - A comparative analysis of the usefulness of chain reasoning versus model size was conducted, showing that larger models are better reasoners and can benefit more from the chain reasoning prompt than smaller models. 62 | 63 | 64 | ## Week #2: [Explanations from Large Language Models Make Small Reasoners Better](https://arxiv.org/pdf/2210.06726.pdf) 65 | 66 | - Tuning and inference of LLMs are not trivial in terms of computational cost, so creating smaller models that can be used to solve specific tasks using LLMs as teachers can have several advantages. 67 | - In this case, an LLM is used to produce chain reasoning that is then validated by comparing the final what answer from the LLM with that provided by dataset y. A new dataset {x, e, y} of explanations is then created from a smaller dataset {x, y} containing only question and answer; the new dataset of examples is used to train a smaller T5 3B model to produce the answer along with the chain of thought. 68 | - The results show that the resulting model has comparable performance on the Common Sense Question Answering dataset of GPT3 using Zero-Shot-CoT. 69 | 70 | 71 | ## Week #1: [Holistic Evaluation of Language Models (HELM)](https://arxiv.org/pdf/2211.09110.pdf) 72 | 73 | - The Holistic Evaluation of Language Models (HELM) is a toolkit designed to improve the transparency of language models and better understand their capabilities, limitations, and risks. 74 | - HELM uses a multi-metric approach to evaluate language models across a wide range of scenarios and metrics, including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. 75 | - HELM conducts a large-scale evaluation of 30 prominent language models across 42 different scenarios, including 21 that have not previously been used in mainstream LM evaluation. The results of the evaluation and all raw model prompts and completions are made publicly available. 76 | 77 | 78 | 79 | 80 |

81 | Subscribe to the newsletter • 82 | Join the community 83 |

84 | --------------------------------------------------------------------------------