├── .gitignore ├── README.md ├── architecture-patterns-python ├── abstraction.jpeg ├── arch_patterns_cover.jpg └── notes.md ├── clean-code ├── clean_code.jpeg └── notes.md ├── examples ├── fp_python.ipynb ├── map_reduce.ipynb ├── network_wikipedia.ipynb ├── parallel_benchmark.ipynb └── parallel_example.py ├── high-performance-python ├── compiler_options.png ├── cover.jpg └── notes.md ├── learning-python-design-patterns ├── cover.jpg └── notes.md └── master-large-data-python ├── cover.png ├── map_reduce.png └── notes.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | .ipynb_checkpoints 3 | .vscode -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # My notes on some books I read on Python 2 | 3 | - High performance Python: Practical Performant Programming for Humans - Micha Gorelick, Ian Ozsvald [[open notes]](./high-performance-python/notes.md) 4 | 5 | - Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code - John T. Wolohan [[open notes]](./master-large-data-python/notes.md) 6 | 7 | - Clean Code: A Handbook of Agile Software Craftsmanship - Robert C. Martin [[open notes]](./clean-code/notes.md) 8 | 9 | - Learning Python Design Patterns - Chetan Giridhar [[open notes]](./learning-python-design-patterns/notes.md) 10 | 11 | - Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices - Bob Gregory, Harry Percival [[open notes]](./architecture-patterns-python/notes.md) 12 | 13 | - Fluent Python - Luciano Ramalho 14 | 15 | ![compiler_options](high-performance-python/compiler_options.png) 16 | 17 | ![map_reduce](master-large-data-python/map_reduce.png) -------------------------------------------------------------------------------- /architecture-patterns-python/abstraction.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/architecture-patterns-python/abstraction.jpeg -------------------------------------------------------------------------------- /architecture-patterns-python/arch_patterns_cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/architecture-patterns-python/arch_patterns_cover.jpg -------------------------------------------------------------------------------- /architecture-patterns-python/notes.md: -------------------------------------------------------------------------------- 1 | # Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices 2 | 3 | Authors: Bob Gregory, Harry Percival 4 | 5 | 6 | [Available here](https://www.cosmicpython.com/) 7 | 8 | ![arch_patterns_cover](arch_patterns_cover.jpg) 9 | 10 | - [Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices](#architecture-patterns-with-python-enabling-test-driven-development-domain-driven-design-and-event-driven-microservices) 11 | - [Introduction](#introduction) 12 | - [Encapsulation and Abstractions](#encapsulation-and-abstractions) 13 | - [Layering](#layering) 14 | - [Dependency Inversion Principle (DIP)](#dependency-inversion-principle-dip) 15 | - [Part I. Building an architecture to support domain modeling](#part-i-building-an-architecture-to-support-domain-modeling) 16 | - [Ch1. Domain Modeling](#ch1-domain-modeling) 17 | - [What is a domain model?](#what-is-a-domain-model) 18 | - [Dataclasses are great for value objects](#dataclasses-are-great-for-value-objects) 19 | - [Value objects and entities](#value-objects-and-entities) 20 | - [Ch2. Repository Pattern](#ch2-repository-pattern) 21 | - [Object-relational mappers (ORMs)](#object-relational-mappers-orms) 22 | - [Introducing the repository pattern](#introducing-the-repository-pattern) 23 | - [What is the trade-off?](#what-is-the-trade-off) 24 | - [Ch3. A brief interlude: on coupling and abstractions](#ch3-a-brief-interlude-on-coupling-and-abstractions) 25 | - [Ch4. Our first use case: flask API and service layer](#ch4-our-first-use-case-flask-api-and-service-layer) 26 | - [A typical service function](#a-typical-service-function) 27 | - [Service Layer vs Domain Service](#service-layer-vs-domain-service) 28 | - [Ch5. TDD in high gear and low gear](#ch5-tdd-in-high-gear-and-low-gear) 29 | - [High and low gear](#high-and-low-gear) 30 | - [Ch6. Unit of Work pattern](#ch6-unit-of-work-pattern) 31 | - [Ch7. Aggregates and consistency boundaries](#ch7-aggregates-and-consistency-boundaries) 32 | - [Why not just run everything in a spreadsheet?](#why-not-just-run-everything-in-a-spreadsheet) 33 | - [Invariants, constraints and consistency](#invariants-constraints-and-consistency) 34 | - [What is an Aggregate?](#what-is-an-aggregate) 35 | - [Choosing an Aggregate](#choosing-an-aggregate) 36 | - [One Aggregate = One Repository](#one-aggregate--one-repository) 37 | - [Optimistic concurrency](#optimistic-concurrency) 38 | - [Part II. Event-Driven Architecture](#part-ii-event-driven-architecture) 39 | 40 | # Introduction 41 | 42 | > A big ball of mud is the natural state of software in the same way that wilderness is the natural state of your garden. It takes energy and direction to prevent the colapse 43 | 44 | ## Encapsulation and Abstractions 45 | Encapsulating behavior by using abstractions is a powerful tool for making code more: 46 | - expressive 47 | - testable 48 | - easier to maintain 49 | 50 | > **Responsibility-driven design**: uses the words *roles* and *responsibilities* rather than *tasks*. Think about code in terms of behavior, rather than in terms of data or algorithms 51 | 52 | ## Layering 53 | - When one function, module, or object uses another, we say that the one *depends on* the other. These dependencies form a kind of network or graph 54 | - **Layered architecture**: divide the code into discrete categories or roles, and introduce rules about which categories of code can call each other 55 | 56 | ## Dependency Inversion Principle (DIP) 57 | Formal definition: 58 | 1. High-level modules should not depend on low-level modules. Both should depend on abstractions 59 | 2. Abstractions should not depend on details. Instead, details should depend on abstractions 60 | 61 | > *Depends on* doesn't mean *imports* or *calls*, necessarily, but rather a more general idea that one module *knows about* or *needs* another module 62 | 63 | # Part I. Building an architecture to support domain modeling 64 | 65 | > "Behavior should come first and drive our storage requirements." 66 | 67 | # Ch1. Domain Modeling 68 | ## What is a domain model? 69 | - Domain model = business logic layer 70 | - Domain: "The problem you're trying to solve" 71 | - Model: a map of a process or phenomenon that captures a useful property 72 | 73 | > **Domain-driven design (DDD)**: the most important thing about software is that it provides a useful model of a problem. If we get that model right, our software delivers value and makes new things possible 74 | 75 | - Make sure to express rules in the business jargon (*ubiquitous language* in DDD terminology) 76 | - "We could show this code to our nontechnical coworkers, and they would agree that this correctly describes the behavior of the system" 77 | - Avoiding a ball of mud: stick rigidly to principles of encapsulation and careful layering 78 | - `from typing import NewType`: wrap primitive types 79 | 80 | ### Dataclasses are great for value objects 81 | - Business concept that has data but no identity: choose to represent it using the *Value Object* pattern 82 | - *Value Object*: any domain object that is uniquely identified by the data it holds (we usually make them immutable) -> `from dataclasses import dataclass`; `@dataclass(frozen=True)` decorator above class definition 83 | - Dataclasses (or namedtuples) give us *value equality* 84 | 85 | ### Value objects and entities 86 | - Value object: any object that is identified only by its data and doesn't have a long-lived identity 87 | - Entity: domain object that has long-lived identity 88 | - Entities, unlike values, have *identity equality*: we can change their values, and they are still recognizably the same thing -> make this explicit in code by implementing `__eq__` (equality operator) on entities 89 | - `__hash__` is the magic method Python uses to control the behavior of objects when you add them to sets or use them as dict keys 90 | - **Python's magic methods let us use our models with idiomatic Python** 91 | 92 | # Ch2. Repository Pattern 93 | - A simplifying abstraction over data storage -> allow us to decouple our model layer from the data layer 94 | - **Layered architecture**: aim to keep the layers separate, and to have each layer depend only on the one below it 95 | - **Onion architecture**: think of our model as being on the "inside" and dependencies flowing inward to it 96 | - **Dependency inversion principle**: high-level modules (the domain) should not depend on low-level ones (the infrastructure) 97 | 98 | ## Object-relational mappers (ORMs) 99 | - Bridge the conceptual gap between the world of objects and domain modeling and the world of databases and relational algebra 100 | - Gives us *persistence ignorance*: our domain model doesn't need to know anything about how data is loaded or persisted -> keep our domain clean of direct dependencies on particular database technologies 101 | 102 | ## Introducing the repository pattern 103 | - The repository pattern is an abstraction over persistent storage 104 | - It hides the boring details of data access by pretending that all of our data is in memory 105 | 106 | > We often just rely on Python's duck typing to enable abstractions. To a Pythonista, a repository is *any* object that has `add(thing)` and `get(id)` methods 107 | 108 | ## What is the trade-off? 109 | - Write a few lines of code in our repository class each time we add a new domain object that we want to retrieve 110 | - In return we get a simple abstraction over our storage layer, which we control 111 | - Easy to fake out for unit tests 112 | 113 | # Ch3. A brief interlude: on coupling and abstractions 114 | - **Coupled components**: when we're unable to change component A for fear of breaking component B 115 | - Locally coupling = good = high *cohesion* between the coupled elements 116 | - Globally coupling = nuisance = increases the risk/cost of changing our code 117 | 118 | 119 | ![abstraction](abstraction.jpeg) 120 | - **Abstraction**: protect us from change by hiding away the complex details of whatever system B does 121 | - We can change the arrows on the right without changing the ones on the left 122 | 123 | > Try to write a simple implementation and then refactor toward better design 124 | 125 | # Ch4. Our first use case: flask API and service layer 126 | 127 | ## A typical service function 128 | 1. Fetch some objects from the repository 129 | 2. Make some checks or assertions about the request against the current state of the world 130 | 3. Call a domain service 131 | 4. If all is well, save/update any state changed 132 | 133 | - Flask app responsibilities (standard web stuff): 134 | - Per-request session management 135 | - Parsing information out of POST parameters 136 | - Response status codes 137 | - JSON 138 | - All the orchestration logic is in the use case/service layer and the domain logic stays in the domain 139 | 140 | ## Service Layer vs Domain Service 141 | - **Service Layer (application service)**: handle requests from the outside world and orchestrate an operation 142 | - **Domain Service**: logic that belongs in the domain model, but doesn't sit naturally inside a stateful entity of value object 143 | 144 | # Ch5. TDD in high gear and low gear 145 | - The service layer helps us clearly define our use cases and the workflow for each 146 | - Tests: help us change our system fearlessly 147 | - Don't write too many tests against the domain model: when you need to change the codebase you may need to update several unit tests 148 | - Testing against the service layer: tests don't nteract directly with "private" methods/attributes on our model objects = easier to refactor them 149 | 150 | > "Every line of code that we put in a test is like a **blob of glue**, holding the system in a particular shape. The more low-level tests we have, the harder it will be to change things." 151 | 152 | - **"Listen to the code"**: when writing tests and find that the code is hard to use or some code smell = trigger to refactor and reconsider the design 153 | - To improve the design of the code we must delete "sketch" tests that are to tightly coupled to a particular implementation 154 | 155 | ## High and low gear 156 | - When starting a new project or gnarly problem: write tests against the domain model = better feedback 157 | - When adding a new feature or fixing a bug (don't need to make extensive changes to the domain model): write tests against the services = lower coupling and higher coverage 158 | - **Shifting gears metaphor** 159 | - Mitigation: keep all domain dependencies in fixture functions 160 | 161 | # Ch6. Unit of Work pattern 162 | - Abstraction over the idea of *atomic operations* 163 | - Allow us to fully decouple our service layer from the data layer 164 | - We can implement it using Python's **context manager** 165 | - Context manager: `__enter__` and `__exit__` are the two magic methods that execute when we enter the `with` block and when we exit it, respectively = setup and teardown phases 166 | 167 | > **Don't mock what you don't own**: rule of thumb that forces us to build simple abstractions over messy subsystems 168 | 169 | # Ch7. Aggregates and consistency boundaries 170 | ## Why not just run everything in a spreadsheet? 171 | **"CSV over SMTP"** architecture: 172 | - Low initial complexity 173 | - Do not to scale very well 174 | - Difficult to apply logic and maintain consistency 175 | 176 | ## Invariants, constraints and consistency 177 | - **Constraint**: rule that restricts the possible states our model can get into 178 | - **Invariant**: defined a little more precisely as a condition that is always true 179 | - **Locks**: prevents two operations from happening simultaneously on the same row or same table 180 | 181 | ## What is an Aggregate? 182 | - A domain object that contains other domain objects and lets us treat the whole collection as a single unit 183 | - The only way to modify the objects inside the aggregate is to load the whole thing, and to call methods on the aggregate itself 184 | 185 | > "An Aggregate is a cluster of associated objects that we treat as a unit for the purpose of data changes" - Eric Evans, DDD blue book 186 | 187 | ## Choosing an Aggregate 188 | - Draw a boundary around a small number of objects (smaller the better for performance) 189 | - Have to be consistent with one another 190 | - Give a good name 191 | 192 | > **Bounded contexts**: reaction against attempts to capture entire businesses into a single model 193 | 194 | ## One Aggregate = One Repository 195 | - **Repositories should only return aggregates** 196 | - Aggregates are the only entities that are publicly accessible to the outside world 197 | 198 | ## Optimistic concurrency 199 | - Version numbers are one way to implement it 200 | - Another implementation: setting the Postgres transaction isolation level to SERIALIZABLE (severe performance cost) 201 | - **Optimistic** = default assumption is that everything will be fine when two users want to make changes to the database -> you need to explicitly handle the possibility of failures in the case of a clash (retry the failure operation from the beginning) 202 | - **Pessimistic** concurrency = assumption that two users are going to cause conflicts -> prevent conflict in all cases -> lock everything just to be safe -> don't need to think about handling failures because the databse will prevent them for you (you do need to think about deadlocks) 203 | 204 | # Part II. Event-Driven Architecture -------------------------------------------------------------------------------- /clean-code/clean_code.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/clean-code/clean_code.jpeg -------------------------------------------------------------------------------- /clean-code/notes.md: -------------------------------------------------------------------------------- 1 | # Clean Code: A Handbook of Agile Software Craftsmanship 2 | 3 | Author: Robert C. Martin 4 | 5 | [Available here](https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882) 6 | 7 | ![clean_code](clean_code.jpeg) 8 | 9 | - [Clean Code: A Handbook of Agile Software Craftsmanship](#clean-code-a-handbook-of-agile-software-craftsmanship) 10 | - [Ch1. Clean Code](#ch1-clean-code) 11 | - [Reading vs. Writing](#reading-vs-writing) 12 | - [Ch2. Meaningful Names](#ch2-meaningful-names) 13 | - [Use intention-revealing names](#use-intention-revealing-names) 14 | - [Avoid disinformation](#avoid-disinformation) 15 | - [Make meaningful distinctions](#make-meaningful-distinctions) 16 | - [Use pronounceable names](#use-pronounceable-names) 17 | - [Use searchable names](#use-searchable-names) 18 | - [Avoid encodings](#avoid-encodings) 19 | - [Avoid mental mapping](#avoid-mental-mapping) 20 | - [Class names](#class-names) 21 | - [Method names](#method-names) 22 | - [Don't be cute](#dont-be-cute) 23 | - [Pick one word per concept](#pick-one-word-per-concept) 24 | - [Don't pun](#dont-pun) 25 | - [Use solution domain names](#use-solution-domain-names) 26 | - [Use problem domain names](#use-problem-domain-names) 27 | - [Add meaningful context](#add-meaningful-context) 28 | - [Don't add gratuitous context](#dont-add-gratuitous-context) 29 | - [Ch3. Functions](#ch3-functions) 30 | - [Small](#small) 31 | - [Blocks and Indenting](#blocks-and-indenting) 32 | - [Do one thing](#do-one-thing) 33 | - [One level of abstraction per function](#one-level-of-abstraction-per-function) 34 | - [The Stepdown rule](#the-stepdown-rule) 35 | - [Use descriptive names](#use-descriptive-names) 36 | - [Function arguments](#function-arguments) 37 | - [How do you write functions like this?](#how-do-you-write-functions-like-this) 38 | - [Ch4. Comments](#ch4-comments) 39 | - [Ch5. Formatting](#ch5-formatting) 40 | - [The newspaper metaphor](#the-newspaper-metaphor) 41 | - [Vertical formatting](#vertical-formatting) 42 | - [Horizontal formatting](#horizontal-formatting) 43 | - [Ch6. Objects and Data Structures](#ch6-objects-and-data-structures) 44 | - [Data/Object anti-symmetry](#dataobject-anti-symmetry) 45 | - [Data transfer objects (DTO)](#data-transfer-objects-dto) 46 | - [Active records](#active-records) 47 | - [Objects](#objects) 48 | - [Data Structures](#data-structures) 49 | - [Ch7. Error Handling](#ch7-error-handling) 50 | - [Write your `Try-Catch-Finally` statement first](#write-your-try-catch-finally-statement-first) 51 | - [Provide context with exceptions](#provide-context-with-exceptions) 52 | - [Define the normal flow](#define-the-normal-flow) 53 | - [Ch8. Boundaries](#ch8-boundaries) 54 | - [Ch9. Unit Tests](#ch9-unit-tests) 55 | - [The three laws of TDD](#the-three-laws-of-tdd) 56 | - [Keeping tests clean](#keeping-tests-clean) 57 | - [Clean tests](#clean-tests) 58 | - [F.I.R.S.T.](#first) 59 | - [Ch10. Classes](#ch10-classes) 60 | - [The Single Responsibility Principle](#the-single-responsibility-principle) 61 | - [Cohesion](#cohesion) 62 | - [Organizing for change](#organizing-for-change) 63 | - [Ch11. Systems](#ch11-systems) 64 | - [Separate constructing a system from using it](#separate-constructing-a-system-from-using-it) 65 | - [Separation of main](#separation-of-main) 66 | - [Factories](#factories) 67 | - [Dependency injection (DI)](#dependency-injection-di) 68 | - [Scaling up](#scaling-up) 69 | - [Test drive the system architecture](#test-drive-the-system-architecture) 70 | - [Optimize decision making](#optimize-decision-making) 71 | - [Ch12. Emergence](#ch12-emergence) 72 | - [Simple design rule 1: runs all the tests](#simple-design-rule-1-runs-all-the-tests) 73 | - [Simple design rule 2-4: refactoring](#simple-design-rule-2-4-refactoring) 74 | - [No duplication](#no-duplication) 75 | - [Expressive](#expressive) 76 | - [Minimal classes and methods](#minimal-classes-and-methods) 77 | - [Ch13. Concurrency](#ch13-concurrency) 78 | - [Why concurrency?](#why-concurrency) 79 | - [Myths and misconceptions](#myths-and-misconceptions) 80 | - [Concurrency defense principles](#concurrency-defense-principles) 81 | - [Know your execution models](#know-your-execution-models) 82 | - [Others](#others) 83 | 84 | # Ch1. Clean Code 85 | > "Most managers want good code, even when they are obsessing about the schedule (...) It's *your* job to defend the code with equal passion" 86 | 87 | - Clean code is *focused*: each function, each class, each module exposes a single-minded attitude that remains entirely undistracted, and upolluted, by the surrounding details 88 | - Code, without tests, is not clean. No matter how elegant it is, no matter how readable and accessible, if it hath not tests, it be unclean 89 | - You will read it, and it will be pretty much what you expected. It will be obvious, simple, and compelling 90 | 91 | ## Reading vs. Writing 92 | - The ratio of time spent reading vs. writing is well over 10:1 93 | - We are constantly reading old code as part of the effort to write new code 94 | - **We want the reading of code to be easy, even if it makes the writing harder** 95 | - You cannot write code if you cannot read the surrounding code 96 | - If you want to go fast, get done quickly, if you want your code to be easy to write, make it easy to read 97 | 98 | # Ch2. Meaningful Names 99 | 100 | ## Use intention-revealing names 101 | > Choosing good names takes time, but saves more than it takes. Take care with your names and change them when you find better ones 102 | 103 | ## Avoid disinformation 104 | - Avoid leaving false clues that obscure the meaning of code 105 | - Avoid words whose entrenched meanings vary from our intended meaning 106 | 107 | ## Make meaningful distinctions 108 | If names must be different, then they should also mean something different 109 | 110 | ## Use pronounceable names 111 | - Humans are good at words 112 | - Words are, by definition, pronounceable 113 | 114 | ## Use searchable names 115 | Single-letter names and numeric constants have a particular problem in that they are not easy to locate across a body of text 116 | 117 | ## Avoid encodings 118 | Encoding type or scope information into names simply adds an extra burden of deciphering 119 | 120 | ## Avoid mental mapping 121 | > Clarity is king 122 | 123 | ## Class names 124 | - Classes and objects should have noun or noun phrase names 125 | - A class name should not be a verb 126 | 127 | ## Method names 128 | Methods should have verb or verb phrase names 129 | 130 | ## Don't be cute 131 | - Choose clarity over entertainment value 132 | - Say what you mean. Mean what you say 133 | 134 | ## Pick one word per concept 135 | A consistent lexicon is a great boon to the programmers who must use your code 136 | 137 | ## Don't pun 138 | Avoid using the same word for two purposes -> essentially a pun 139 | 140 | ## Use solution domain names 141 | - People who read your code will be programmers 142 | - Use CS terms, algorithm names, pattern names, math terms 143 | 144 | ## Use problem domain names 145 | - Separate solution and problem domain concepts 146 | - Code that has more to do with problem domain concepts should have names drawn from the problem domain 147 | 148 | ## Add meaningful context 149 | Most names are not meaningful in and of themselves 150 | 151 | ## Don't add gratuitous context 152 | - Shorter names are generally better than long ones, so long as they are clear 153 | - Add no more context to a name than is necessary 154 | 155 | > Choosing good names requires good descriptive skills and a shared cultural background. This is a teaching issue rather than a technical, business, or management issue 156 | 157 | # Ch3. Functions 158 | ## Small 159 | Functions should be small 160 | 161 | ### Blocks and Indenting 162 | - Blocks within `if` statements, `else` statements, `while` statements should be on line long -> probably a function call 163 | - Keep the enclosing function small, adds documentary value 164 | - Functions should not be large enough to hold nested structures -> makes easier to read and understand 165 | 166 | ## Do one thing 167 | > Functions should do one thing. They should do it well. They should do it only 168 | 169 | - Reasons to write functions: decompose a larger concept (the name of the function) into a set of steps at the next level of abstraction 170 | - Functions that do one thing cannot be divided into sections 171 | 172 | ## One level of abstraction per function 173 | - Once details are mixed with essential concepts, more details tend to accrete within the function 174 | 175 | ### The Stepdown rule 176 | - We want code to be read like a top-down narrative 177 | - A set of TO paragraphs, each describing the current level of abstraction and referencing subsequent TO paragraphs at the next level down 178 | 179 | ## Use descriptive names 180 | Ward's principle: *"You know you are working on clean code when each routine turns out to be pretty much what you expected"* 181 | 182 | - Spend time choosing a name 183 | - You should try several different names and read the code with each in place 184 | 185 | ## Function arguments 186 | Ideal number of arguments for a function: 187 | - zero (niladic) 188 | - one (monadic) 189 | - two (dyadic) 190 | - more than that should be avoided where possible 191 | 192 | - **Arguments are hard from a testing point of view** -> test cases for all combinations of arguments 193 | - Output arguments are harder to understand than input arguments 194 | - **Passing a boolean into a function (flag arguments) is a terrible practice** -> loudly proclaiming that this function does more than one thing -> does one thing if the flag is true and another if the flag is false! 195 | - When a function seems to need more than two or three arguments, it is likely that some of those arguments ought to be wrapped into a class of their own -> When groups of variables are passed together, they are likely part of a concept that deserves a name of its own 196 | - Side effects are lies -> your functions promises to do one thing, but it also does other *hidden* things 197 | - Either your function should change the state of an object, or it should return some information about the object 198 | - **Prefer Exceptions to returing error codes** 199 | - Extract try/catch blocks into functions of their own 200 | - Functions should do one thing -> error handling is one thing 201 | - Don't repeat yourself -> duplication may be the root of all evil in software 202 | 203 | ## How do you write functions like this? 204 | Writing software is like any other kind of writing 205 | 1. Get your thoughts down first 206 | 2. Massage it until it reads well 207 | 208 | The first draft might be clumsy and disorganized, so you restructure it and refine it until it reads the way you want it to read 209 | 210 | > Every system is built from a domain-specific language designed by the programmers to describe the system. Functions are the verbs of that language, and classes are the nouns. 211 | 212 | # Ch4. Comments 213 | - Comments are always failures. We must have them because we cannot always figure out how to express ourselves without them, but their use is not a cause for celebration 214 | - Comments lie. Not always, and not intentionally, but too often 215 | - The older a comment is, and the farther away it is from the code it describes, the more likely it is to be wrong 216 | - **Truth can only be found in the code** 217 | - Explain your intent in code: **create a function that says the same thing as the comment you want to write** 218 | - A comment may be used to amplify the importance of something that may otherwise seem inconsequential 219 | - We have good source code control systems now. Those systems will remember the code for us. We don’t have to comment it out any more. Just delete the code 220 | - Short functions don't need much description -> well-chosen name for a small function that does one thing is better than a comment header 221 | 222 | # Ch5. Formatting 223 | Code formatting 224 | - Too important to ignore 225 | - Is about communication -> developer's first order of business 226 | 227 | > Small files are easier to understand than large files are 228 | 229 | ## The newspaper metaphor 230 | Source file should be like a newspaper article 231 | - Name should be simple but explanatory 232 | - The name, by itself, should be sufficient to tell us whether we are in the right module or not 233 | 234 | ## Vertical formatting 235 | - Avoid forcing the reader to hop around through the source files and classes 236 | - **Dependent functions**: if one function calls another, they should be vertically close, and the caller should be above the callee 237 | 238 | ## Horizontal formatting 239 | - Strive to keep your lines short 240 | - Beyond 100~120 isn't advisable 241 | 242 | # Ch6. Objects and Data Structures 243 | 244 | ## Data/Object anti-symmetry 245 | > Objects hide their data behind abstractions and expose functions that operate on that data. Data structure expose their data and have no meaningful functions 246 | 247 | - Procedural code (code using data structures) makes it easy to add new functions without changing the existing data structures. OO code makes it easy to add new classes without changing existing functions 248 | - Procedural code makes it hard to add new data structures because all the functions must change. OO code makes it hard to add new functions because all the classes must change 249 | 250 | > Mature programmers know that the idea that *everything is an object is a myth*. Sometimes you really do want simple data structures with procedures operating on them 251 | 252 | ## Data transfer objects (DTO) 253 | DTO: quintessential form of a data structure -> a class with public variables and no functions 254 | 255 | ### Active records 256 | - Special forms of DTOs 257 | - Data structures with public (or bean-accessed) variables; but they typically have navigational methods like `save` and `find` 258 | 259 | ## Objects 260 | - expose behavior and hide data 261 | - easy to add new kinds of objects without changing existing behaviors 262 | - hard to add new behaviors to existing objects 263 | 264 | ## Data Structures 265 | - expose data and have no significant behavior 266 | - easy to add new behaviors to existing data structures 267 | - hard to add new data structures to existing functions 268 | 269 | # Ch7. Error Handling 270 | > Things can go wrong, and when they do, we as programmers are responsible for making sure that our code what it needs to do 271 | 272 | - Error handling is important, but if it obscures logic, it's wrong 273 | - It is better to throw an exception when you encounter an error. The calling code is cleaner. Its logic is not obscured by error handling 274 | 275 | ## Write your `Try-Catch-Finally` statement first 276 | - `try` blocks are like transactions 277 | - Your `catch` has to leave your program in a consistent state, no matter what happens in the `try` 278 | - Try to write tests to force exceptions, and then add behavior to your handler to satisfy your tests -> cause you to build the transaction scope of the `try` block first and help maintain the transaction nature of that scope 279 | 280 | ## Provide context with exceptions 281 | - Create informative error messages and pass them along with your exceptions 282 | - Mention the operation that failed and the type of failure 283 | - If you are logging in your application, pass along enough information to be able to log the error in your `catch` 284 | 285 | > Wrapping third-party APIs is a best practice -> minimize your dependencies upon it: you can choose to move to a different library in the future without much penalty; makes it easier to mock out third-party calls when you are testing your own code 286 | 287 | ## Define the normal flow 288 | **Special case pattern**: you create a class or configure an object so that it handles a special case for you -> the client code doesn't have to deal with exceptional behavior 289 | 290 | # Ch8. Boundaries 291 | - It's not our job to test the third-party code, but it may be in our best interest to write tests for the third-party code we use 292 | - **Learning tests**: call the third-party API, as we expect to use it in our application -> controlled experiments that check our understanding of that API 293 | - **Clean Boundaries**: code at the boundaries needs clear separation and tests that define expectations 294 | 295 | > Avoid letting too much of our code know about the third-party particulars. It's betters to depend on something you control than on something you don't control, lest it end up controlling you 296 | 297 | # Ch9. Unit Tests 298 | ## The three laws of TDD 299 | - **First Law**: You may not write production code until you have written a failing unit test 300 | - **Second Law**: You may not write more of a unit test than is sufficient to fail, and not compiling is failing 301 | - **Third Law**: You may not write more production code than is sufficient to pass the current failing test 302 | 303 | ## Keeping tests clean 304 | - Having dirty tests is equivalent to, if not worse than, having no tests 305 | - Tests must change as the production code evolves -> the dirtier the tests, the harder they are to change 306 | - If your tests are dirty, you begin to lose the ability to improve the structure of that code 307 | > Test code is just as important as production code. It requires thought, design, and care. It must be kept as clean as production code 308 | 309 | ## Clean tests 310 | Readability is perhaps even more important in unit tests than it is in production code 311 | - Clarity 312 | - Simplicity 313 | - Density of expression (say a lot with as few expressions as possible) 314 | 315 | **BUILD-OPERATE-CHECK** pattern: 316 | - First part builds up the test data 317 | - Second part operates on that test data 318 | - Third part checks that the operation yielded the expected results 319 | 320 | **Domain-Specific Testing Language**: testing language (specialized API used by the tests) -> make tests expressive and succint -> make the tests more convenient to write and easier to read 321 | 322 | **given-when-then** convention: makes the tests even easier to read 323 | 324 | **TEMPLATE METHOD** pattern -> putting the given/when parts in the base classs, and the then parts in different derivatives 325 | 326 | - The number of asserts in a test ought to be minimized 327 | - We want to test a single concept in each test function 328 | 329 | ## F.I.R.S.T. 330 | - **Fast**: when tests run slow, you won't want to run them frequently 331 | - **Independent**: you should be able to run each test independently and run the tests in any order you like 332 | - **Repeatable**: if your tests aren't repeatable in any environment, then you'll always have an excuse for why they fail 333 | - **Self-Validating**: you should not have to read through a log file to tell whether the tests pass (should have a boolean output -> pass/fail) 334 | - **Timely**: unit tests should be written just before the production code that makes them pass 335 | 336 | # Ch10. Classes 337 | - Smaller is the primary rule when it comes to designing classes 338 | - Name of the class = describe what responsibilities it fulfills 339 | - If we cannot derive a concise name for a class, then it's likely too large -> the more ambiguous the class name, the more likely it has too many responsibilities 340 | 341 | ## The Single Responsibility Principle 342 | - **SRP** is one of the more important concepts in OO design 343 | - States that a class or module should have one and only one, *reason to change* 344 | - Definition of responsibility 345 | - Guidelines for class size 346 | - A system with many small classes has no more moving parts than a system with a few large classes 347 | 348 | > Trying to identify responsibilities (reasons to change) often helps us recognize and create better abstractions in our code 349 | 350 | ## Cohesion 351 | - Classes should have a small number of instance variables 352 | - Each of the methods of a class should manipulate one or more of those variables 353 | - A class in which each variable is used by each method is **maximally cohesive** 354 | - Maintaining cohesion results in many small classes 355 | 356 | ## Organizing for change 357 | - Change is continual 358 | - Every change -> risk that the remainder of the system no longer works as intended 359 | - Clean system -> organize our classes to reduce the risk of change 360 | 361 | > **Open-Closed Principle (OCP)**: another key OO class design principle -> Classes should be open for extension but closed for modification 362 | 363 | - Ideal system -> we incorporate new features by extending the system, not by making modifications to existing code 364 | 365 | > **Dependency Inversion Principle (DIP)** -> classes should depend upon abstractions, not on concrete details 366 | 367 | # Ch11. Systems 368 | ## Separate constructing a system from using it 369 | > Software systems should separate the startup process, when the application objects are constructed and the dependencies are "wired" together, from the runtime logic that takes over after startup 370 | 371 | - Startup process: *concern* that any application must address 372 | - *Separation of concerns*: one of the most important design techniques 373 | - Never let little, convenient idioms lead to modularity breakdown 374 | 375 | ## Separation of main 376 | ### Factories 377 | - **ABSTRACT FACTORY**: pattern -> give the application control of *when* to build the object, but keep the details of that construction separate from the application code 378 | 379 | ### Dependency injection (DI) 380 | - Powerful mechanism for separating construction from use 381 | - Application of *Inversion of Control* (IoC) to dependency management 382 | - Moves secondary responsibilities from an object to other objects that are dedicated to the purpose (supporting SRP) 383 | - The invoking object doesn't control what kind of object is actually returned, but the invoking object still actively resolves the dependency 384 | > An object should not take responsibility for instantiating dependencies itself. Instead, it should pass this responsibility to another "authoritative" mechanism (inverting control). Setup is a global concern, this authoritative mechanism will be either the "main" routine or a special-purpose container 385 | 386 | ## Scaling up 387 | - **Myth**: we can get systems "right the first time" 388 | - Implement only today's stories -> then refactor and expand the system to implement new stories tomorrow = essence of iterative and incremental agility 389 | - TDD, refactoring, and the clean code they produce make this work at the code level 390 | - Software systems are unique compared to physical systems. Their archiectures can grow incrementally, **if we maintain the proper separation of concerns** 391 | 392 | ## Test drive the system architecture 393 | - **Big Design Up Front (BDUF)**: harmful because it inhibits adapting to change, due to psychological resistance to discarding prior effort and because of the way architecture choices influence subsequent thinking about the design 394 | 395 | ## Optimize decision making 396 | - Modularity and separation of concerns make decentralized management and decision making possible 397 | - Give responsibilities to the most qualified persons 398 | - **It is best to postpone decisions until the last possible moment** -> lets us make informed choices with the best possible information. A premature decision is a decision made with suboptimal knowledge 399 | 400 | > Whether you are designing systems or individual modules, never forget to use **the simplest thing that can possibly work** 401 | 402 | # Ch12. Emergence 403 | A design is "simple", if it follows these rules: 404 | - Run all the tests 405 | - Contains no duplication 406 | - Expresses the intent of the programmer 407 | - Minimizes the number of classes and methods 408 | 409 | ## Simple design rule 1: runs all the tests 410 | - Systems that aren't testable aren't verifiable 411 | - A system that cannot be verified should never be deployed 412 | - Tight coupling makes it difficult to write tests 413 | - The more tests we write, the more we use principles like DIP and tools like dependency injection, interfaces, and abstraction to minimize coupling -> our designs improve even more 414 | - Primary OO goals -> low coupling and high cohesion 415 | 416 | ## Simple design rule 2-4: refactoring 417 | For each few lines of code we add, we pause and reflect on the new design 418 | 419 | ### No duplication 420 | - Duplication is the primary enemy of a well-designed system 421 | - It represents additional work, additional risk, and additional unnecessary complexity 422 | - TEMPLATE METHOD pattern: common technique for removing higher-level duplication 423 | 424 | ### Expressive 425 | > It's easy to write code that *we* understand, because at the time we write it we're deep in an understanding of the problem we're trying to solve. Other maintainers of the code aren't going to have so deep an understanding 426 | 427 | - Choose good names 428 | - Keep your functions and classes small 429 | - Use standard nomenclature 430 | - Tests primary goal = act as documentation by example 431 | - **The most important way to be expressive is to try. Care is a precious resource** 432 | 433 | ### Minimal classes and methods 434 | - Effort to make our classes and methods small -> we might create too many tiny classes and methods -> **also keep our function and class counts low!** 435 | 436 | > Although it's important to keep class and function count low, it's more important to have tests, eliminate duplication, and express yourself 437 | 438 | # Ch13. Concurrency 439 | > Objects are abstractions of processing. Threads are abstractions of schedule - James O. Coplien 440 | 441 | ## Why concurrency? 442 | - Concurrency is a decoupling strategy 443 | - Helps us decouple **what** gets done from **when** it gets done 444 | 445 | ## Myths and misconceptions 446 | - Concurrency can *sometimes* improve performance, but only when there is a lot of wait time that can be shared between multiple threads or multiple processors 447 | - The design of a concurrent algorithm can be remarkably different from the design of a single-threaded system 448 | - **Concurrency bugs aren't usually repeatable**, so they are often ignored as one-offs instead of the true defects they are 449 | - Concurrency often requires a fundamental change in design strategy 450 | 451 | ## Concurrency defense principles 452 | - **Single responsibility principle**: keep your concurrency-related code separate from other code 453 | - **Limit the scope of data**: data encapsulation; severely limit the access of any data that may be shared 454 | - **Use copies of data** 455 | - **Threads should be as independent as possible** 456 | 457 | ### Know your execution models 458 | - **Producer-Consumer** 459 | - **Readers-Writers** 460 | - **Dining Philosophers** 461 | 462 | ### Others 463 | - Keep synchronized sections small 464 | - Think about shut-down early and get it working early 465 | - Write tests that have the potential to expose problems and then run them frequently, with different programatic configurations and system configurations and load 466 | - Do not ignore system failures as one-offs 467 | - Do not try to chase down nonthreading bugs and threading bugs at the same time. Make sure your code works outside of threads 468 | - Make your thread-based code especially pluggable so that you can run it in various configurations 469 | - Run your threaded code on all target platforms early and often 470 | 471 | > Code that is simple to follow can become nightmarish when multiple threads and shared data get into the mix -> you need to write clean code with rigor or else face subtle and infrequent failures 472 | -------------------------------------------------------------------------------- /examples/fp_python.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Functional Programming with Python\n", 8 | "\n", 9 | "## Problem -> Parse csv\n", 10 | "input\n", 11 | "```\n", 12 | "csv = \"\"\"firstName;lastName\n", 13 | "Ada;Lovelace\n", 14 | "Emmy;Noether\n", 15 | "Marie;Curie\n", 16 | "Tu;Youyou\n", 17 | "Ada;Yonath\n", 18 | "Vera;Rubin\n", 19 | "Sally;Ride\"\"\"\n", 20 | "```\n", 21 | "\n", 22 | "output\n", 23 | "```\n", 24 | "target = [{'firstName': 'Ada', 'lastName': 'Lovelace'},\n", 25 | " {'firstName': 'Emmy', 'lastName': 'Noether'},\n", 26 | " ...]\n", 27 | "```\n", 28 | "\n", 29 | "[source](https://www.youtube.com/watch?v=r2eZ7lhqzNE)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 1, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "csv = \"\"\"firstName;lastName\n", 39 | "Ada;Lovelace\n", 40 | "Emmy;Noether\n", 41 | "Marie;Curie\n", 42 | "Tu;Youyou\n", 43 | "Ada;Yonath\n", 44 | "Vera;Rubin\n", 45 | "Sally;Ride\"\"\"" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "### Imperative Python" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 2, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "lines = csv.split('\\n')\n", 62 | "matrix = [line.split(';') for line in lines]\n", 63 | "header = matrix.pop(0)\n", 64 | "records = []\n", 65 | "for row in matrix:\n", 66 | " record = {}\n", 67 | " for index, key in enumerate(header):\n", 68 | " record[key] = row[index]\n", 69 | " records.append(record)" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "data": { 79 | "text/plain": [ 80 | "[{'firstName': 'Ada', 'lastName': 'Lovelace'},\n", 81 | " {'firstName': 'Emmy', 'lastName': 'Noether'},\n", 82 | " {'firstName': 'Marie', 'lastName': 'Curie'},\n", 83 | " {'firstName': 'Tu', 'lastName': 'Youyou'},\n", 84 | " {'firstName': 'Ada', 'lastName': 'Yonath'},\n", 85 | " {'firstName': 'Vera', 'lastName': 'Rubin'},\n", 86 | " {'firstName': 'Sally', 'lastName': 'Ride'}]" 87 | ] 88 | }, 89 | "execution_count": 3, 90 | "metadata": {}, 91 | "output_type": "execute_result" 92 | } 93 | ], 94 | "source": [ 95 | "records" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "### Functional Python" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 4, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "from toolz.curried import compose, map\n", 112 | "from functools import partial\n", 113 | "from operator import methodcaller\n", 114 | "\n", 115 | "split = partial(methodcaller, 'split')\n", 116 | "split_lines = split('\\n')\n", 117 | "split_fields = split(';')\n", 118 | "dict_from_keys_vals = compose(dict, zip)\n", 119 | "csv_to_matrix = compose(map(split_fields), split_lines)\n", 120 | "\n", 121 | "matrix = csv_to_matrix(csv)\n", 122 | "keys = next(matrix)\n", 123 | "records = map(partial(dict_from_keys_vals, keys), matrix)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 5, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "data": { 133 | "text/plain": [ 134 | "" 135 | ] 136 | }, 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "records" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 6, 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "data": { 153 | "text/plain": [ 154 | "[{'firstName': 'Ada', 'lastName': 'Lovelace'},\n", 155 | " {'firstName': 'Emmy', 'lastName': 'Noether'},\n", 156 | " {'firstName': 'Marie', 'lastName': 'Curie'},\n", 157 | " {'firstName': 'Tu', 'lastName': 'Youyou'},\n", 158 | " {'firstName': 'Ada', 'lastName': 'Yonath'},\n", 159 | " {'firstName': 'Vera', 'lastName': 'Rubin'},\n", 160 | " {'firstName': 'Sally', 'lastName': 'Ride'}]" 161 | ] 162 | }, 163 | "execution_count": 6, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "list(records)" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": 30, 175 | "metadata": {}, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "text/plain": [ 180 | "{'firstName': ['firstName', 'lastName'], 'lastName': ['Ada', 'Lovelace']}" 181 | ] 182 | }, 183 | "execution_count": 30, 184 | "metadata": {}, 185 | "output_type": "execute_result" 186 | } 187 | ], 188 | "source": [ 189 | "from toolz.curried import pipe\n", 190 | "\n", 191 | "pipe(csv, csv_to_matrix, partial(dict_from_keys_vals, keys))" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 15, 197 | "metadata": {}, 198 | "outputs": [ 199 | { 200 | "data": { 201 | "text/plain": [ 202 | "\u001b[1;31mSignature:\u001b[0m \u001b[0mpipe\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0mfuncs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 203 | "\u001b[1;31mDocstring:\u001b[0m\n", 204 | "Pipe a value through a sequence of functions\n", 205 | "\n", 206 | "I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``\n", 207 | "\n", 208 | "We think of the value as progressing through a pipe of several\n", 209 | "transformations, much like pipes in UNIX\n", 210 | "\n", 211 | "``$ cat data | f | g | h``\n", 212 | "\n", 213 | ">>> double = lambda i: 2 * i\n", 214 | ">>> pipe(3, double, str)\n", 215 | "'6'\n", 216 | "\n", 217 | "See Also:\n", 218 | " compose\n", 219 | " compose_left\n", 220 | " thread_first\n", 221 | " thread_last\n", 222 | "\u001b[1;31mFile:\u001b[0m c:\\users\\fraga\\anaconda3\\lib\\site-packages\\toolz\\functoolz.py\n", 223 | "\u001b[1;31mType:\u001b[0m function\n" 224 | ] 225 | }, 226 | "metadata": {}, 227 | "output_type": "display_data" 228 | } 229 | ], 230 | "source": [ 231 | "?pipe" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 12, 237 | "metadata": {}, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/plain": [ 242 | "\u001b[1;31mInit signature:\u001b[0m \u001b[0mpartial\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 243 | "\u001b[1;31mDocstring:\u001b[0m \n", 244 | "partial(func, *args, **keywords) - new function with partial application\n", 245 | "of the given arguments and keywords.\n", 246 | "\u001b[1;31mFile:\u001b[0m c:\\users\\fraga\\anaconda3\\lib\\functools.py\n", 247 | "\u001b[1;31mType:\u001b[0m type\n", 248 | "\u001b[1;31mSubclasses:\u001b[0m \n" 249 | ] 250 | }, 251 | "metadata": {}, 252 | "output_type": "display_data" 253 | } 254 | ], 255 | "source": [ 256 | "?partial" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 7, 262 | "metadata": {}, 263 | "outputs": [ 264 | { 265 | "data": { 266 | "text/plain": [ 267 | "\u001b[1;31mSignature:\u001b[0m \u001b[0mcompose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0mfuncs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 268 | "\u001b[1;31mDocstring:\u001b[0m\n", 269 | "Compose functions to operate in series.\n", 270 | "\n", 271 | "Returns a function that applies other functions in sequence.\n", 272 | "\n", 273 | "Functions are applied from right to left so that\n", 274 | "``compose(f, g, h)(x, y)`` is the same as ``f(g(h(x, y)))``.\n", 275 | "\n", 276 | "If no arguments are provided, the identity function (f(x) = x) is returned.\n", 277 | "\n", 278 | ">>> inc = lambda i: i + 1\n", 279 | ">>> compose(str, inc)(3)\n", 280 | "'4'\n", 281 | "\n", 282 | "See Also:\n", 283 | " compose_left\n", 284 | " pipe\n", 285 | "\u001b[1;31mFile:\u001b[0m c:\\users\\fraga\\anaconda3\\lib\\site-packages\\toolz\\functoolz.py\n", 286 | "\u001b[1;31mType:\u001b[0m function\n" 287 | ] 288 | }, 289 | "metadata": {}, 290 | "output_type": "display_data" 291 | } 292 | ], 293 | "source": [ 294 | "?compose" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 13, 300 | "metadata": {}, 301 | "outputs": [ 302 | { 303 | "data": { 304 | "text/plain": [ 305 | "\u001b[1;31mInit signature:\u001b[0m \u001b[0mmethodcaller\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 306 | "\u001b[1;31mDocstring:\u001b[0m \n", 307 | "methodcaller(name, ...) --> methodcaller object\n", 308 | "\n", 309 | "Return a callable object that calls the given method on its operand.\n", 310 | "After f = methodcaller('name'), the call f(r) returns r.name().\n", 311 | "After g = methodcaller('name', 'date', foo=1), the call g(r) returns\n", 312 | "r.name('date', foo=1).\n", 313 | "\u001b[1;31mFile:\u001b[0m c:\\users\\fraga\\anaconda3\\lib\\operator.py\n", 314 | "\u001b[1;31mType:\u001b[0m type\n", 315 | "\u001b[1;31mSubclasses:\u001b[0m \n" 316 | ] 317 | }, 318 | "metadata": {}, 319 | "output_type": "display_data" 320 | } 321 | ], 322 | "source": [ 323 | "?methodcaller" 324 | ] 325 | } 326 | ], 327 | "metadata": { 328 | "kernelspec": { 329 | "display_name": "Python 3", 330 | "language": "python", 331 | "name": "python3" 332 | }, 333 | "language_info": { 334 | "codemirror_mode": { 335 | "name": "ipython", 336 | "version": 3 337 | }, 338 | "file_extension": ".py", 339 | "mimetype": "text/x-python", 340 | "name": "python", 341 | "nbconvert_exporter": "python", 342 | "pygments_lexer": "ipython3", 343 | "version": "3.7.4" 344 | } 345 | }, 346 | "nbformat": 4, 347 | "nbformat_minor": 4 348 | } 349 | -------------------------------------------------------------------------------- /examples/map_reduce.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Map and Reduce style of programming with Python" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 64, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "import dill as pickle\n", 17 | "from random import choice\n", 18 | "from functools import reduce\n", 19 | "from pathos.multiprocessing import ProcessingPool as Pool\n", 20 | "from toolz.sandbox.parallel import fold\n", 21 | "from itertools import starmap" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 61, 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/plain": [ 32 | "\u001b[1;31mDocstring:\u001b[0m\n", 33 | "reduce(function, sequence[, initial]) -> value\n", 34 | "\n", 35 | "Apply a function of two arguments cumulatively to the items of a sequence,\n", 36 | "from left to right, so as to reduce the sequence to a single value.\n", 37 | "For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates\n", 38 | "((((1+2)+3)+4)+5). If initial is present, it is placed before the items\n", 39 | "of the sequence in the calculation, and serves as a default when the\n", 40 | "sequence is empty.\n", 41 | "\u001b[1;31mType:\u001b[0m builtin_function_or_method\n" 42 | ] 43 | }, 44 | "metadata": {}, 45 | "output_type": "display_data" 46 | } 47 | ], 48 | "source": [ 49 | "?reduce" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 24, 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "name": "stdout", 59 | "output_type": "stream", 60 | "text": [ 61 | "249\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "xs = [10, 5, 1, 19, 11, 203]\n", 67 | "\n", 68 | "def my_add(acc, nxt):\n", 69 | " return acc + nxt\n", 70 | "\n", 71 | "print(reduce(my_add, xs, 0))" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 25, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "249\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "print(reduce(lambda acc, nxt: acc+nxt, xs, 0))" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 62, 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "data": { 98 | "text/plain": [ 99 | "\u001b[1;31mInit signature:\u001b[0m \u001b[0mPool\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 100 | "\u001b[1;31mDocstring:\u001b[0m \n", 101 | "Mapper that leverages python's multiprocessing.\n", 102 | " \n", 103 | "\u001b[1;31mInit docstring:\u001b[0m\n", 104 | "Important class members:\n", 105 | " nodes - number (and potentially description) of workers\n", 106 | " ncpus - number of worker processors\n", 107 | " servers - list of worker servers\n", 108 | " scheduler - the associated scheduler\n", 109 | " workdir - associated $WORKDIR for scratch calculations/files\n", 110 | "\n", 111 | "Other class members:\n", 112 | " scatter - True, if uses 'scatter-gather' (instead of 'worker-pool')\n", 113 | " source - False, if minimal use of TemporaryFiles is desired\n", 114 | " timeout - number of seconds to wait for return value from scheduler\n", 115 | " \n", 116 | "NOTE: if number of nodes is not given, will autodetect processors\n", 117 | " \n", 118 | "\u001b[1;31mFile:\u001b[0m c:\\users\\fraga\\anaconda3\\lib\\site-packages\\pathos\\multiprocessing.py\n", 119 | "\u001b[1;31mType:\u001b[0m type\n", 120 | "\u001b[1;31mSubclasses:\u001b[0m \n" 121 | ] 122 | }, 123 | "metadata": {}, 124 | "output_type": "display_data" 125 | } 126 | ], 127 | "source": [ 128 | "?Pool" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 28, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "name": "stdout", 138 | "output_type": "stream", 139 | "text": [ 140 | "6.15 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "%%timeit -r 1\n", 146 | "with Pool() as P: \n", 147 | " fold(my_add, range(1000000), map=P.imap)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 29, 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "name": "stdout", 157 | "output_type": "stream", 158 | "text": [ 159 | "129 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n" 160 | ] 161 | } 162 | ], 163 | "source": [ 164 | "%%timeit -r 1\n", 165 | "reduce(my_add, range(1000000))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "### Speeding up map and reduce\n", 173 | "> Using a parallel map can counterintuitively be slower than using a lazy map in map an reduce scenarios\n", 174 | "\n", 175 | "We can always use parallelization at the reduce level instead of at the map level" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": 30, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "def map_combination(left, right):\n", 185 | " return left + right\n", 186 | "\n", 187 | "\n", 188 | "def keep_if_even(acc, nxt):\n", 189 | " if nxt % 2 == 0:\n", 190 | " return acc + [nxt]\n", 191 | " else: \n", 192 | " return acc" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 63, 198 | "metadata": {}, 199 | "outputs": [ 200 | { 201 | "data": { 202 | "text/plain": [ 203 | "\u001b[1;31mSignature:\u001b[0m\n", 204 | "\u001b[0mfold\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m\n", 205 | "\u001b[0m \u001b[0mbinop\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n", 206 | "\u001b[0m \u001b[0mseq\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n", 207 | "\u001b[0m \u001b[0mdefault\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'__no__default__'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n", 208 | "\u001b[0m \u001b[0mmap\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m<\u001b[0m\u001b[1;32mclass\u001b[0m \u001b[1;34m'map'\u001b[0m\u001b[1;33m>\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n", 209 | "\u001b[0m \u001b[0mchunksize\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m128\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n", 210 | "\u001b[0m \u001b[0mcombine\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n", 211 | "\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 212 | "\u001b[1;31mDocstring:\u001b[0m\n", 213 | "Reduce without guarantee of ordered reduction.\n", 214 | "\n", 215 | "inputs:\n", 216 | "\n", 217 | "``binop`` - associative operator. The associative property allows us to\n", 218 | " leverage a parallel map to perform reductions in parallel.\n", 219 | "``seq`` - a sequence to be aggregated\n", 220 | "``default`` - an identity element like 0 for ``add`` or 1 for mul\n", 221 | "\n", 222 | "``map`` - an implementation of ``map``. This may be parallel and\n", 223 | " determines how work is distributed.\n", 224 | "``chunksize`` - Number of elements of ``seq`` that should be handled\n", 225 | " within a single function call\n", 226 | "``combine`` - Binary operator to combine two intermediate results.\n", 227 | " If ``binop`` is of type (total, item) -> total\n", 228 | " then ``combine`` is of type (total, total) -> total\n", 229 | " Defaults to ``binop`` for common case of operators like add\n", 230 | "\n", 231 | "Fold chunks up the collection into blocks of size ``chunksize`` and then\n", 232 | "feeds each of these to calls to ``reduce``. This work is distributed\n", 233 | "with a call to ``map``, gathered back and then refolded to finish the\n", 234 | "computation. In this way ``fold`` specifies only how to chunk up data but\n", 235 | "leaves the distribution of this work to an externally provided ``map``\n", 236 | "function. This function can be sequential or rely on multithreading,\n", 237 | "multiprocessing, or even distributed solutions.\n", 238 | "\n", 239 | "If ``map`` intends to serialize functions it should be prepared to accept\n", 240 | "and serialize lambdas. Note that the standard ``pickle`` module fails\n", 241 | "here.\n", 242 | "\n", 243 | "Example\n", 244 | "-------\n", 245 | "\n", 246 | ">>> # Provide a parallel map to accomplish a parallel sum\n", 247 | ">>> from operator import add\n", 248 | ">>> fold(add, [1, 2, 3, 4], chunksize=2, map=map)\n", 249 | "10\n", 250 | "\u001b[1;31mFile:\u001b[0m c:\\users\\fraga\\anaconda3\\lib\\site-packages\\toolz\\sandbox\\parallel.py\n", 251 | "\u001b[1;31mType:\u001b[0m function\n" 252 | ] 253 | }, 254 | "metadata": {}, 255 | "output_type": "display_data" 256 | } 257 | ], 258 | "source": [ 259 | "?fold" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 39, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "name": "stdout", 269 | "output_type": "stream", 270 | "text": [ 271 | "996 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 272 | ] 273 | } 274 | ], 275 | "source": [ 276 | "%%timeit -r 1\n", 277 | "with Pool() as P:\n", 278 | " fold(keep_if_even, range(100000), [],\n", 279 | " map=P.imap, combine=map_combination)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 22, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "name": "stdout", 289 | "output_type": "stream", 290 | "text": [ 291 | "4.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 292 | ] 293 | } 294 | ], 295 | "source": [ 296 | "%%timeit -r 1\n", 297 | "reduce(keep_if_even, range(100000), [])" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 56, 303 | "metadata": {}, 304 | "outputs": [], 305 | "source": [ 306 | "N = 100000\n", 307 | "P = Pool()\n", 308 | "xs = range(N)\n", 309 | "\n", 310 | "\n", 311 | "# Parallel summation\n", 312 | "def my_add(left, right):\n", 313 | " return left+right" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 57, 319 | "metadata": {}, 320 | "outputs": [ 321 | { 322 | "name": "stdout", 323 | "output_type": "stream", 324 | "text": [ 325 | "611 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 326 | ] 327 | } 328 | ], 329 | "source": [ 330 | "%%timeit -r 1\n", 331 | "fold(my_add, xs, map=P.imap)" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 58, 337 | "metadata": {}, 338 | "outputs": [], 339 | "source": [ 340 | "# Parallel filter\n", 341 | "def map_combination(left, right):\n", 342 | " return left + right\n", 343 | "\n", 344 | "def keep_if_even(acc, nxt):\n", 345 | " if nxt % 2 == 0:\n", 346 | " return acc + [nxt]\n", 347 | " else: \n", 348 | " return acc" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 59, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "981 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "%%timeit -r 1\n", 366 | "fold(keep_if_even, xs, [], map=P.imap, combine=map_combination)" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 43, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "name": "stdout", 376 | "output_type": "stream", 377 | "text": [ 378 | "{1: 16586, 2: 16732, 3: 16727, 4: 16651, 5: 16683, 6: 16621}\n" 379 | ] 380 | } 381 | ], 382 | "source": [ 383 | "#Parallel frequencies\n", 384 | "def combine_counts(left, right):\n", 385 | " unique_keys = set(left.keys()).union(set(right.keys()))\n", 386 | " return {k:left.get(k,0)+right.get(k,0) for k in unique_keys}\n", 387 | "\n", 388 | "def make_counts(acc, nxt):\n", 389 | " acc[nxt] = acc.get(nxt,0) + 1\n", 390 | " return acc\n", 391 | "\n", 392 | "xs = (choice([1,2,3,4,5,6]) for _ in range(N))" 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": 60, 398 | "metadata": {}, 399 | "outputs": [ 400 | { 401 | "name": "stdout", 402 | "output_type": "stream", 403 | "text": [ 404 | "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 405 | ] 406 | } 407 | ], 408 | "source": [ 409 | "%%timeit -r 1\n", 410 | "fold(make_counts, xs, {}, map=P.imap, combine=combine_counts)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 65, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/plain": [ 421 | "\u001b[1;31mInit signature:\u001b[0m \u001b[0mstarmap\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", 422 | "\u001b[1;31mDocstring:\u001b[0m \n", 423 | "starmap(function, sequence) --> starmap object\n", 424 | "\n", 425 | "Return an iterator whose values are returned from the function evaluated\n", 426 | "with an argument tuple taken from the given sequence.\n", 427 | "\u001b[1;31mType:\u001b[0m type\n", 428 | "\u001b[1;31mSubclasses:\u001b[0m \n" 429 | ] 430 | }, 431 | "metadata": {}, 432 | "output_type": "display_data" 433 | } 434 | ], 435 | "source": [ 436 | "?starmap" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 67, 442 | "metadata": {}, 443 | "outputs": [ 444 | { 445 | "name": "stdout", 446 | "output_type": "stream", 447 | "text": [ 448 | "[8, 3, 1, 19, 22]\n" 449 | ] 450 | } 451 | ], 452 | "source": [ 453 | "xs = [7, 3, 1, 19, 11]\n", 454 | "ys = [8, 1, -3, 14, 22]\n", 455 | "\n", 456 | "loop_maxes = [max(ys[i], x) for i, x in enumerate(xs)]\n", 457 | "map_maxes = list(starmap(max, zip(xs, ys)))\n", 458 | "\n", 459 | "print(loop_maxes)" 460 | ] 461 | }, 462 | { 463 | "cell_type": "code", 464 | "execution_count": 68, 465 | "metadata": {}, 466 | "outputs": [ 467 | { 468 | "name": "stdout", 469 | "output_type": "stream", 470 | "text": [ 471 | "[8, 3, 1, 19, 22]\n" 472 | ] 473 | } 474 | ], 475 | "source": [ 476 | "print(map_maxes)" 477 | ] 478 | } 479 | ], 480 | "metadata": { 481 | "kernelspec": { 482 | "display_name": "Python 3", 483 | "language": "python", 484 | "name": "python3" 485 | }, 486 | "language_info": { 487 | "codemirror_mode": { 488 | "name": "ipython", 489 | "version": 3 490 | }, 491 | "file_extension": ".py", 492 | "mimetype": "text/x-python", 493 | "name": "python", 494 | "nbconvert_exporter": "python", 495 | "pygments_lexer": "ipython3", 496 | "version": "3.7.4" 497 | } 498 | }, 499 | "nbformat": 4, 500 | "nbformat_minor": 4 501 | } 502 | -------------------------------------------------------------------------------- /examples/parallel_benchmark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "3.7.7\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "from platform import python_version\n", 18 | "\n", 19 | "print(python_version())" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": {}, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/plain": [ 30 | "'1.18.0'" 31 | ] 32 | }, 33 | "execution_count": 2, 34 | "metadata": {}, 35 | "output_type": "execute_result" 36 | } 37 | ], 38 | "source": [ 39 | "import fklearn\n", 40 | "\n", 41 | "fklearn.__version__" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "import numpy as np\n", 51 | "\n", 52 | "np.random.seed(42)" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 4, 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [ 61 | "from fklearn.data.datasets import make_tutorial_data\n", 62 | "from fklearn.preprocessing.splitting import space_time_split_dataset\n", 63 | "from concurrent.futures import ThreadPoolExecutor" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "TRAIN_START_DATE = '2015-01-01' \n", 73 | "TRAIN_END_DATE = '2015-03-01' \n", 74 | "HOLDOUT_END_DATE = '2015-04-01'\n", 75 | "\n", 76 | "split_fn = space_time_split_dataset(train_start_date=TRAIN_START_DATE,\n", 77 | " train_end_date=TRAIN_END_DATE,\n", 78 | " holdout_end_date=HOLDOUT_END_DATE,\n", 79 | " space_holdout_percentage=.5,\n", 80 | " split_seed=42, \n", 81 | " space_column=\"id\",\n", 82 | " time_column=\"date\")" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 6, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "def measure_time_split_fn(sample_size):\n", 92 | " result = %timeit -r 1 -o split_fn(data[:sample_size])\n", 93 | " return result" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 7, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "sample_sizes = [10000, 100000, 250000, 500000, 750000, 1000000]\n", 103 | "data = make_tutorial_data(sample_sizes[-1])" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 8, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "7.12 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)\n", 116 | "26.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n", 117 | "54.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n", 118 | "102 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n", 119 | "152 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n", 120 | "204 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n" 121 | ] 122 | } 123 | ], 124 | "source": [ 125 | "old_function_times = []\n", 126 | "\n", 127 | "for sample_size in sample_sizes:\n", 128 | " time_old = %timeit -r 1 -o split_fn(data[:sample_size])\n", 129 | " old_function_times.append(time_old.best)" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 9, 135 | "metadata": {}, 136 | "outputs": [ 137 | { 138 | "data": { 139 | "text/plain": [ 140 | "[0.007119678000000001,\n", 141 | " 0.026497800000000016,\n", 142 | " 0.05441616999999992,\n", 143 | " 0.10246849999999999,\n", 144 | " 0.15187872999999996,\n", 145 | " 0.2037241999999999]" 146 | ] 147 | }, 148 | "execution_count": 9, 149 | "metadata": {}, 150 | "output_type": "execute_result" 151 | } 152 | ], 153 | "source": [ 154 | "old_function_times" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 10, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "name": "stdout", 164 | "output_type": "stream", 165 | "text": [ 166 | "985 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 167 | "1.01 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)939 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 168 | "\n", 169 | "947 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 170 | "876 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 171 | "149 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n" 172 | ] 173 | } 174 | ], 175 | "source": [ 176 | "with ThreadPoolExecutor() as executor:\n", 177 | " result = executor.map(measure_time_split_fn, sample_sizes)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 11, 183 | "metadata": {}, 184 | "outputs": [ 185 | { 186 | "data": { 187 | "text/plain": [ 188 | "[,\n", 189 | " ,\n", 190 | " ,\n", 191 | " ,\n", 192 | " ,\n", 193 | " ]" 194 | ] 195 | }, 196 | "execution_count": 11, 197 | "metadata": {}, 198 | "output_type": "execute_result" 199 | } 200 | ], 201 | "source": [ 202 | "list(result)" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 12, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "import numpy as np\n", 212 | "from numpy import nan\n", 213 | "import pandas as pd\n", 214 | "\n", 215 | "\n", 216 | "def new_make_tutorial_data(n: int) -> pd.DataFrame:\n", 217 | " \"\"\"\n", 218 | " Generates fake data for a tutorial. There are 3 numerical features (\"num1\", \"num3\" and \"num3\")\n", 219 | " and tow categorical features (\"cat1\" and \"cat2\")\n", 220 | " sex, age and severity, the treatment is a binary variable, medication and the response\n", 221 | " days until recovery.\n", 222 | " Parameters\n", 223 | " ----------\n", 224 | " n : int\n", 225 | " The number of samples to generate\n", 226 | " Returns\n", 227 | " ----------\n", 228 | " df : pd.DataFrame\n", 229 | " A tutorial dataset\n", 230 | " \"\"\"\n", 231 | " np.random.seed(1111)\n", 232 | "\n", 233 | " dataset = pd.DataFrame({\n", 234 | " \"id\": list(map(lambda x: \"id%d\" % x, np.random.randint(0, 1000000, n))),\n", 235 | " \"date\": np.random.choice(pd.date_range(\"2015-01-01\", periods=100), n),\n", 236 | " \"feature1\": np.random.gamma(20, size=n),\n", 237 | " \"feature2\": np.random.normal(40, size=n),\n", 238 | " \"feature3\": np.random.choice([\"a\", \"b\", \"c\"], size=n)})\n", 239 | "\n", 240 | " dataset[\"target\"] = (dataset[\"feature1\"]\n", 241 | " + dataset[\"feature2\"]\n", 242 | " + dataset[\"feature3\"].apply(lambda x: 0 if x == \"a\" else 30 if x == \"b\" else 10)\n", 243 | " + np.random.normal(0, 5, size=n))\n", 244 | "\n", 245 | " # insert some NANs\n", 246 | " dataset.loc[np.random.randint(0, n, 100), \"feature1\"] = nan\n", 247 | " dataset.loc[np.random.randint(0, n, 100), \"feature3\"] = nan\n", 248 | "\n", 249 | " return dataset" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 13, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "# sample_sizes = [10000, 50000, 100000, 200000, 250000,]\n", 259 | "sample_sizes = [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]\n", 260 | "data = new_make_tutorial_data(sample_sizes[-1])" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 14, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "name": "stdout", 270 | "output_type": "stream", 271 | "text": [ 272 | "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 273 | "2.45 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 274 | "2.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 275 | "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 276 | "2.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 277 | "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 278 | "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)2.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n", 279 | "\n" 280 | ] 281 | } 282 | ], 283 | "source": [ 284 | "with ThreadPoolExecutor() as executor:\n", 285 | " result = executor.map(measure_time_split_fn, sample_sizes)" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [] 294 | } 295 | ], 296 | "metadata": { 297 | "kernelspec": { 298 | "display_name": "Python 3", 299 | "language": "python", 300 | "name": "python3" 301 | }, 302 | "language_info": { 303 | "codemirror_mode": { 304 | "name": "ipython", 305 | "version": 3 306 | }, 307 | "file_extension": ".py", 308 | "mimetype": "text/x-python", 309 | "name": "python", 310 | "nbconvert_exporter": "python", 311 | "pygments_lexer": "ipython3", 312 | "version": "3.7.7" 313 | } 314 | }, 315 | "nbformat": 4, 316 | "nbformat_minor": 4 317 | } 318 | -------------------------------------------------------------------------------- /examples/parallel_example.py: -------------------------------------------------------------------------------- 1 | import collections 2 | from concurrent.futures import ThreadPoolExecutor 3 | import os 4 | import time 5 | from pprint import pprint 6 | 7 | 8 | def get_age(scientist): 9 | time.sleep(1) 10 | print(f"Process {os.getpid()} working record {scientist.name}") 11 | name_and_age = { 12 | "name": scientist.name, 13 | "age": int(time.strftime("%Y")) - scientist.born, 14 | } 15 | print(f"Process {os.getpid()} done processing record {scientist.name}") 16 | 17 | return name_and_age 18 | 19 | 20 | if __name__ == "__main__": 21 | Scientist = collections.namedtuple("Scientist", ["name", "field", "born", "nobel",]) 22 | 23 | scientists = ( 24 | Scientist(name="Ada Lovelace", field="math", born=1815, nobel=False), 25 | Scientist(name="Emmy Noether", field="math", born=1882, nobel=False), 26 | Scientist(name="Marie Curie", field="physics", born=1867, nobel=True), 27 | Scientist(name="Tu Youyou", field="chemistry", born=1930, nobel=True), 28 | Scientist(name="Ada Yonath", field="chemistry", born=1939, nobel=True), 29 | Scientist(name="Vera Rubin", field="astronomy", born=1928, nobel=False), 30 | Scientist(name="Sally Ride", field="physics", born=1951, nobel=False), 31 | ) 32 | 33 | print("\nSerial execution") 34 | start = time.time() 35 | result = tuple(map(get_age, scientists)) 36 | end = time.time() 37 | pprint(result) 38 | print(f"\nTime to complete: {end - start:.2f}s\n") 39 | 40 | 41 | print("\nParallel execution") 42 | start = time.time() 43 | with ThreadPoolExecutor() as executor: 44 | result = executor.map(get_age, scientists) 45 | end = time.time() 46 | pprint(tuple(result)) 47 | print(f"\nTime to complete: {end - start:.2f}s\n") 48 | 49 | -------------------------------------------------------------------------------- /high-performance-python/compiler_options.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/high-performance-python/compiler_options.png -------------------------------------------------------------------------------- /high-performance-python/cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/high-performance-python/cover.jpg -------------------------------------------------------------------------------- /high-performance-python/notes.md: -------------------------------------------------------------------------------- 1 | # High performance Python: Practical Performant Programming for Humans 2 | 3 | Authors: Micha Gorelick, Ian Ozsvald 4 | 5 | ![cover](cover.jpg) 6 | 7 | - [High performance Python: Practical Performant Programming for Humans](#high-performance-python-practical-performant-programming-for-humans) 8 | - [Ch1. Understanding Performant Python](#ch1-understanding-performant-python) 9 | - [Why use Python?](#why-use-python) 10 | - [How to be a highly performant programmer](#how-to-be-a-highly-performant-programmer) 11 | - [Ch2. Profiling to Find Bottlenecks](#ch2-profiling-to-find-bottlenecks) 12 | - [cProfile module](#cprofile-module) 13 | - [Visualizing cProfile output with Snakeviz](#visualizing-cprofile-output-with-snakeviz) 14 | - [Using line_profiler for line-by-line measurements](#using-line_profiler-for-line-by-line-measurements) 15 | - [Using memory_profiler to diagnose memory usage](#using-memory_profiler-to-diagnose-memory-usage) 16 | - [Introspecting an existing process with PySpy](#introspecting-an-existing-process-with-pyspy) 17 | - [Ch3. Lists and Tuples](#ch3-lists-and-tuples) 18 | - [Ch4. Dictionaries and Sets](#ch4-dictionaries-and-sets) 19 | - [Complexity and speed](#complexity-and-speed) 20 | - [How do dictionaries and sets work?](#how-do-dictionaries-and-sets-work) 21 | - [Ch5. Iterators and Generators](#ch5-iterators-and-generators) 22 | - [Python `for` loop deconstructed](#python-for-loop-deconstructed) 23 | - [Lazy generator evaluation](#lazy-generator-evaluation) 24 | - [Ch6. Matrix and Vector Computation](#ch6-matrix-and-vector-computation) 25 | - [Memory fragmentation](#memory-fragmentation) 26 | - [numpy](#numpy) 27 | - [numexpr: making in-place operations faster and easier](#numexpr-making-in-place-operations-faster-and-easier) 28 | - [Lessons from matrix optimizations](#lessons-from-matrix-optimizations) 29 | - [Pandas](#pandas) 30 | - [Pandas's internal model](#pandass-internal-model) 31 | - [Building DataFrames and Series from partial results rather than concatenating](#building-dataframes-and-series-from-partial-results-rather-than-concatenating) 32 | - [Advice for effective pandas development](#advice-for-effective-pandas-development) 33 | - [Ch7. Compiling to C](#ch7-compiling-to-c) 34 | - [Python offers](#python-offers) 35 | - [What sort of speed gains are possible?](#what-sort-of-speed-gains-are-possible) 36 | - [JIT versus AOT compilers](#jit-versus-aot-compilers) 37 | - [Why does type information help the code run faster?](#why-does-type-information-help-the-code-run-faster) 38 | - [Using a C compiler](#using-a-c-compiler) 39 | - [Cython](#cython) 40 | - [Numba](#numba) 41 | - [PyPy](#pypy) 42 | - [When to use each technology](#when-to-use-each-technology) 43 | - [Other upcoming projects](#other-upcoming-projects) 44 | - [Graphics Processing Units (GPUs)](#graphics-processing-units-gpus) 45 | - [Dynamic graphs: PyTorch](#dynamic-graphs-pytorch) 46 | - [Basic GPU profiling](#basic-gpu-profiling) 47 | - [When to use GPUs](#when-to-use-gpus) 48 | - [Ch8. Asynchronous I/O](#ch8-asynchronous-io) 49 | - [Introduction to asynchronous programming](#introduction-to-asynchronous-programming) 50 | - [How does async/await work?](#how-does-asyncawait-work) 51 | - [Gevent](#gevent) 52 | - [tornado](#tornado) 53 | - [aiohttp](#aiohttp) 54 | - [Batched results](#batched-results) 55 | - [Ch9. The multiprocessing module](#ch9-the-multiprocessing-module) 56 | - [Replacing multiprocessing with Joblib](#replacing-multiprocessing-with-joblib) 57 | - [Intelligent caching of function call results](#intelligent-caching-of-function-call-results) 58 | - [Using numpy](#using-numpy) 59 | - [Asynchronous systems](#asynchronous-systems) 60 | - [Interprocess Communication (IPC)](#interprocess-communication-ipc) 61 | - [multiprocessing.Manager()](#multiprocessingmanager) 62 | - [Redis](#redis) 63 | - [mmap](#mmap) 64 | - [Ch10. Clusters and Job Queues](#ch10-clusters-and-job-queues) 65 | - [Benefits of clustering](#benefits-of-clustering) 66 | - [Drawbacks of clustering](#drawbacks-of-clustering) 67 | - [Parallel Pandas with Dask](#parallel-pandas-with-dask) 68 | - [Dask](#dask) 69 | - [Swifter](#swifter) 70 | - [Vaex](#vaex) 71 | - [NSQ for robust production clustering](#nsq-for-robust-production-clustering) 72 | - [Ch11. Using less RAM](#ch11-using-less-ram) 73 | - [Objects for primitives are expensive](#objects-for-primitives-are-expensive) 74 | - [The `array` module stores many primitive objects cheaply](#the-array-module-stores-many-primitive-objects-cheaply) 75 | - [Using less RAM in NumPy with NumExpr](#using-less-ram-in-numpy-with-numexpr) 76 | - [Bytes versus Unicode](#bytes-versus-unicode) 77 | - [More efficient tree structures to represent strings](#more-efficient-tree-structures-to-represent-strings) 78 | - [Directed Acyclic Word Graph (DAWG)](#directed-acyclic-word-graph-dawg) 79 | - [Marisa Trie](#marisa-trie) 80 | - [Scikit-learn's DictVectorizer and FeatureHasher](#scikit-learns-dictvectorizer-and-featurehasher) 81 | - [ScyPy's Sparse Matrices](#scypys-sparse-matrices) 82 | - [Tips for using less RAM](#tips-for-using-less-ram) 83 | - [Probabilistic Data Structures](#probabilistic-data-structures) 84 | - [Morris counter](#morris-counter) 85 | - [K-Minimum values](#k-minimum-values) 86 | - [Bloom filters](#bloom-filters) 87 | - [Ch12. Lessons from the field](#ch12-lessons-from-the-field) 88 | 89 | > "Every programmer can benefit from understanding how to build performant systems (...) When something becomes ten times cheaper in time or compute costs, suddenly the set of applications you can address is wider than you imagined" 90 | 91 | Supplemental material for the book (code examples, exercises, etc.) is available for download at https://github.com/mynameisfiber/high_performance_python_2e. 92 | 93 | # Ch1. Understanding Performant Python 94 | 95 | ## Why use Python? 96 | - highly expressive and easy to learn 97 | - `scikit-learn` wraps LIBLINEAR and LIBSVM (written in C) 98 | - `numpy` includes BLAS and other C and Fortran libraries 99 | - python code that properly utilizes these modules can be as fast as comparable C code 100 | - "batteries included" 101 | - enable fast prototyping of an idea 102 | 103 | ## How to be a highly performant programmer 104 | Overall team velocity is far more important than speedups and complicated solutions. Several factors are key to this: 105 | - Good structure 106 | - Documentation 107 | - Debuggability 108 | - Shared standards 109 | 110 | # Ch2. Profiling to Find Bottlenecks 111 | 112 | Profiling let you make the most pragmatic decisions for the least overall effort: Code run "fast enough" and "lean enough" 113 | 114 | > "If you avoid profiling and jump to optmization, you'll quite likely do more work in the long run. Always be driven by the results of profiling" 115 | 116 | *"Embarrassingly parallel problem"*: no data is shared between points 117 | 118 | `timeit` module temporarily disables the garbage collector 119 | 120 | ## cProfile module 121 | Built-in profiling tool in the standard library 122 | 123 | - `profile`: original and slower pure Python profiler 124 | - `cProfile`: same interface as `profile` and is written in `C` for a lower overhead 125 | 126 | 1. Generate a *hypothesis* about the speed of parts of your code 127 | 2. Measure how wrong you are 128 | 3. Improve your intuition about certain coding styles 129 | 130 | ### Visualizing cProfile output with Snakeviz 131 | `snakeviz`: visualizer that draws the output of `cProfile` as a diagram -> larger boxes are areas of code that take longer to run 132 | 133 | ## Using line_profiler for line-by-line measurements 134 | `line_profilier`: strongest tool for identifying the cause of CPU-bound problems in Python code: profile individual functions on a line-by-line basis 135 | 136 | Be aware of the complexity of **Python's dynamic machinery** 137 | 138 | The order of evaluation for Python statements is both **left to right and opportunistic**: put the cheapest test on the left side of the equation 139 | 140 | ## Using memory_profiler to diagnose memory usage 141 | `memory_profiler` measures memory usage on a line-by-line basis: 142 | - Could we use less RAM by rewriting this function to work more efficiently? 143 | - Could we use more RAM and save CPU cycles by caching? 144 | 145 | **Tips** 146 | - Memory profiling make your code run 10-100x slower 147 | - Install `psutil` to `memory_profiler` run faster 148 | - Use `memory_profiler` occasionally and `line_profiler` more frequently 149 | - `--pdb-mmem=XXX` flag: `pdb` debugger is activate after the process exceeds XXX MB -> drop you in directly at the point in your code where too many allocations are occurring 150 | 151 | ## Introspecting an existing process with PySpy 152 | `py-spy`: sampling profiler, don't require any code changes -> it introspects an already-running Python process and reports in the console with a *top-like* display 153 | 154 | # Ch3. Lists and Tuples 155 | - **Lists**: dynamic arrays; mutable and allow for resizing 156 | - **Tuples**: static arrays; immutable and the data within them cannot be changed aftey they have been created 157 | - Tuples are cached by the Python runtime which means that we don't need to talk to the kernel to reserve memory every time we want to use one 158 | 159 | Python lists have a built-in sorting algorithm that uses Tim sort -> O(n) in the best case and O(nlogn) in the worst case 160 | 161 | Once sorted, we can find our desired element using a binary search -> average case of complexity of O(logn) 162 | 163 | Dictionary lookup takes only O(1), but: 164 | - converting the data to a dictionary takes O(n) 165 | - no repeating keys may be undesirable 166 | 167 | `bisect` module: provide alternative functions, heavily optimized 168 | 169 | > "**Pick the right data structure and stick with it!** Although there may be more efficient data structures for particular operations, the cost of converting to those data structures may negate any efficiency boost" 170 | 171 | - Tuples are for describing multiple properties of one unchanging thing 172 | - List can be used to store collections of data about completely disparate objects 173 | - Both can take mixed types 174 | 175 | > "Generic code will be much slower than code specifically designed to solve a particular problem" 176 | 177 | - Tuple (immutable): lightweight data structure 178 | - List (mutable): extra memory needed to store them and extra computations needed when using them 179 | 180 | # Ch4. Dictionaries and Sets 181 | Ideal data structures to use when your data has no intrinsic order (except for insertion order), but does have a unique object that can be used to reference it 182 | - *key*: reference object 183 | - *value*: data 184 | 185 | Sets do not actually contain values: is a collection of unique keys -> useful for doing set operations 186 | 187 | **hashable** type: implements `__hash__` and either `__eq__` or `__cmp__` 188 | 189 | ## Complexity and speed 190 | - O(1) lookups based on the arbitrary index 191 | - O(1) insertion time 192 | - Larger footprint in memory 193 | - Actual speed depends on the hashing function 194 | 195 | ## How do dictionaries and sets work? 196 | Use *hash tables* to achieve O(1) lookups and insertions -> clever usage of a hash function to turn an arbitrary key (i.e., a string or object) into an index for a list 197 | 198 | > *load factor*: how well distributed the data is throughout the hash table -> related to the entropy of the hash function 199 | 200 | Hash functions must return integers 201 | 202 | - Numerical types (`int` and `float`): hash is based on the bit value of the number they represent 203 | - Tuples and strings: hash value based on their contents 204 | - Lists: do not support hashing because their values can change 205 | 206 | > A custom-selected hash function should be careful to evenly distribute hash values in order to avoid collisions (will degrade the performance of a hash table) -> constantly "probe" the other values -> worst case O(n) = searching through a list 207 | 208 | **Entropy**: "how well distributed my hash function is" -> max entropy = *ideal* hash function = minimal number of collisions 209 | 210 | # Ch5. Iterators and Generators 211 | 212 | ## Python `for` loop deconstructed 213 | ```python 214 | # The Python loop 215 | for i in object: 216 | do_work(i) 217 | 218 | # Is equivalent to 219 | object_iterator = iter(object) 220 | while True: 221 | try: 222 | i = next(object_iterator) 223 | except StopIteration: 224 | break 225 | else: 226 | do_work(i) 227 | ``` 228 | 229 | - Changing to generators instead of precomputed arrays may require algorithmic changes (sometimes not so easy to understand) 230 | 231 | > "Many of Python’s built-in functions that operate on sequences are generators themselves. `range` returns a generator of values as opposed to the actual list of numbers within the specified range. Similarly, `map`, `zip`, `filter`, `reversed`, and `enumerate` all perform the calculation as needed and don’t store the full result" 232 | 233 | - Generators have less memory impact than list comprehension 234 | - Generators are really a way of organizing your code and having smarter loops 235 | 236 | ## Lazy generator evaluation 237 | *Single pass* or *online* algorithms: at any point in our calculation with a generator, we have only the current value and cannot reference any other items in the sequence 238 | 239 | `itertools` from the standard library provides useful functions to make generators easier to use: 240 | - `islice`: slicing a potentially infinite generator 241 | - `chain`: chain together multiple generators 242 | - `takewhile`: adds a condition that will end a generator 243 | - `cycle`: makes a finite generator infinite by constantly repeating it 244 | 245 | # Ch6. Matrix and Vector Computation 246 | > Understanding the motivation behind your code and the intricacies of the algorithm will give you deeper insight about possible methods of optimization 247 | 248 | ## Memory fragmentation 249 | Python doesn't natively support vectorization 250 | - Python lists store pointers to the actual data -> good because it allows us to store whatever type of data inside a list, however when it comes to vector and matrix operations, this is a source of performance degradation 251 | - Python bytecode is not optimized for vectorization -> `for` loops cannot predict when using vectorization would be benefical 252 | 253 | *von Neumann bottleneck*: limited bandwidth between memory and CPU as a result of the tiered memory architecture that modern computers use 254 | 255 | `perf` Linux tool: insights into how the CPU is dealing with the program being run 256 | 257 | `array` object is less suitable for math and more suitable for storing fixed-type data more efficiently in memory 258 | 259 | ## numpy 260 | `numpy` has all of the features we need—it stores data in contiguous chunks of memory and supports vectorized operations on its data. As a result, any arithmetic we do on `numpy` arrays happens in chunks without us having to explicitly loop over each element. Not only is it much easier to do matrix arithmetic this way, but it is also faster 261 | 262 | Vectorization from `numpy`: may run fewer instructions per cycle, but each instruction does much more work 263 | 264 | ## numexpr: making in-place operations faster and easier 265 | - `numpy`'s optimization of vector operations: occurs on only one operation at a time 266 | - `numexpr` is a module that can take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. Expressions can utilize multiple CPU cores 267 | - Easy to change code to use `numexpr`: rewrite the expressions as strings with references to local variables 268 | 269 | ## Lessons from matrix optimizations 270 | Always take care of any administrative things the code must do during initialization 271 | - allocating memory 272 | - reading a configuration from a file 273 | - precomputing values that will be needed throughout the lifetime of a program 274 | 275 | ## Pandas 276 | ### Pandas's internal model 277 | - Operations on columns often generate temporary intermediate arrays which consume RAM: expect a temporary memory usage of up to 3-5x your current usage 278 | - Operations can be single-threaded and limited by Python's global interpreter lock (GIL) 279 | - Columns of the same `dtype` are grouped together by a `BlockManager` -> make row-wise operations on columns of the same datatype faster 280 | - Operations on data of a single common block -> *view*; different `dtypes` -> can cause a *copy* (slower) 281 | - Pandas uses a mix of NumPy datatypes and its own extension datatypes 282 | - numpy `int64` isn't NaN aware -> Pandas `Int64` uses two columns of data: integers and NaN bit mask 283 | - numpy `bool` isn't NaN aware -> Pandas `boolean` 284 | 285 | > More safety makes things run slower (checking passing appropriate data) -> **Developer time (and sanity) x Execution time**. Checks enabled: avoid painful debugging sessions, which kill developer productivity. If we know that our data is of the correct form for our chosen algorithm, these checks will add a penalty 286 | 287 | ## Building DataFrames and Series from partial results rather than concatenating 288 | - Avoid repeated calls to `concat` in Pandas (and to the equivalent `concatenate` in NumPy) 289 | - Build lists of intermediate results and then construct a Series or DataFrame from this list, rather than concatenating to an existing object 290 | 291 | ## Advice for effective pandas development 292 | - Install the optional dependencies `numexpr` and `bottleneck` for additional performance improvements 293 | - Caution against chaining too many rows of pandas operations in sequence: difficult to debug, chain only a couple of operations together to simplify your maintenance 294 | - **Filter your data before calculating** on the remaining rows rather than filtering after calculating 295 | - Check the schema of your DataFrames as they evolve -> tool like `bulwark`, you can visualize confirm that your expectations are being met 296 | - Large Series with low cardinality: `df['series_of_strings'].astype('category')` -> `value_counts` and `groupby` run faster and the Series consume less RAM 297 | - Convert 8-byte `float64` and `int64` to smaller datatypes -> 2-byte `float16` or 1-byte `int8` -> smaller range to further save RAM 298 | - Use the `del` keyword to delete earlier references and clear them from memory 299 | - Pandas `drop` method to delete unused columns 300 | - Persist the prepared DataFrame version to disk by using `to_pickle` 301 | - Avoid `inplace=True` -> are scheduled to be removed from the library over time 302 | - `Modin`, `cuDF` 303 | - `Vaex`: work on very large datasets that exceed RAM by using lazy evaluation while retaining a similar interface to Pandas -> large datasets and string-heavy operations 304 | 305 | # Ch7. Compiling to C 306 | To make code run faster: 307 | - Make it do less work 308 | - Choose good algorithms 309 | - Reduce the amount of data you're processing 310 | - Execute fewer instructions -> compile your code down to machine code 311 | 312 | ## Python offers 313 | - `Cython`: pure C-based compiling 314 | - `Numba`: LLVM-based compiling 315 | - `PyPy`: replacement virtual machine which includes a built-in just-in-time (JIT) compiler 316 | 317 | ## What sort of speed gains are possible? 318 | Compiling generate more gains when the code: 319 | - is mathematical 320 | - has lots of loops that repeat the same operations many times 321 | 322 | Unlikely to show speed up: 323 | - calls to external libraries (regexp, string operations, calls to database) 324 | - programs that are I/O-bound 325 | 326 | ## JIT versus AOT compilers 327 | - **AOT (ahead of time)**: `Cython` -> you'll have a library that can instantly be used -> best speedups, but requires the most manual effort 328 | - **JIT (just in time)**: `Numba`, `PyPy` -> you don't have to do much work up front, but you have a "cold start" problem -> impressive speedups with little manual intervention 329 | 330 | ## Why does type information help the code run faster? 331 | Python is dynamically typed -> keeping the code generic makes it run more slowly 332 | 333 | > “Inside a section of code that is CPU-bound, it is often the case that the types of variables do not change. This gives us an opportunity for **static compilation and faster code execution**” 334 | 335 | ## Using a C compiler 336 | `Cython` uses `gcc`: good choice for most platforms; well supported and quite advanced 337 | 338 | ## Cython 339 | - Compiler that converts type-annotaded (C-like) Python into a compiled extension module 340 | - Wide used and mature 341 | - `OpenMP` support: possible to convert parallel problems into multiprocessing-aware modules 342 | - `pyximport`: simplified build system 343 | - Annotation option that output an HTML file -> more yellow = more calls into the Python virtual machine; more white = more non-Python C code 344 | 345 | Lines that cost the most CPU time: 346 | - inside tight inner loops 347 | - dereferencing `list`, `array` or `np.array` items 348 | - mathematical operations 349 | 350 | `cdef` keyword: declare variables inside the function body. These must be declared at the top of the function, as that’s a requirement from the C language specification 351 | 352 | > **Strength reduction**: writing equivalent but more specialized code to solve the same problem. Trade worse flexibility (and possibly worse readability) for faster execution 353 | 354 | `memoryview`: allows the same low-level access to any object that implements the buffer interface, including `numpy` arrays and Python arrays 355 | 356 | ## Numba 357 | - JIT compiler that specializes in `numpy` code, which it compiles via LLVM compiler at runtime 358 | - You provide a decorator telling it which functions to focus on and then you let Numba take over 359 | - `numpy` arrays and nonvectorized code that iterates over many items: Numba should give you a quick and very painless win. 360 | - Numba does not bind to external C libraries (which Cython can do), but it can automatically generate code for GPUs (which Cython cannot). 361 | - OpenMP parallelization support with `prange` 362 | - Break your code into small (<10 line) and discrete functions and tackle these one at a time 363 | 364 | ``` 365 | from numba import jit 366 | 367 | @jit() 368 | def my_fn(): 369 | ``` 370 | 371 | ## PyPy 372 | - Alternative implementation of the Python language that includes a tracing just-in-time compiler 373 | - Offers a faster experience than CPython 374 | - Uses a different type of garbage collector (modified mark-and-sweep) than CPython (reference counting) = may clean up an unused object much later 375 | - PyPy can use a lot of RAM 376 | - `vmprof`: lightweight sampling profiler 377 | 378 | ## When to use each technology 379 | ![compiler_options](./compiler_options.png) 380 | 381 | - `Numba`: quick wins for little effort; young project 382 | - `Cython`: best results for the widest set of prolbmes; requires more effort; mix Python and C annotations 383 | - `PyPy`: strong option if you're not using `numpy` or other hard-to-port C extensions 384 | 385 | ### Other upcoming projects 386 | - Pythran 387 | - Transonic 388 | - ShedSkin 389 | - PyCUDA 390 | - PyOpenCL 391 | - Nuitka 392 | 393 | ## Graphics Processing Units (GPUs) 394 | Easy-to-use GPU mathematics libraries: 395 | - TensorFlow 396 | - PyTorch 397 | 398 | ### Dynamic graphs: PyTorch 399 | Static computational graph tensor library that is particularly user-friendly and has a very intuitive API for anyone familiar with `numpy` 400 | 401 | > *Static computational graph*: performing operations on `PyTorch` objects creates a dynamic definition of a program that gets compiled to GPU code in the background when it is executed -> changes to the Python code automatically get reflected in changes in the GPU code without an explicit compilation step needed 402 | 403 | ### Basic GPU profiling 404 | - `nvidia-smi`: inspect the resource utilization of the GPU 405 | - Power usage is a good proxy for judging how much of the GPU's compute power is being used -> more power the GPU is drawing = more compute it is currently doing 406 | 407 | ### When to use GPUs 408 | - Task requires mainly linear algebra and matrix manipulations (multiplication, addition, Fourier transforms) 409 | - Particularly true if the calculation can happen on the GPU uninterrupted for a period of time before being copied back into system memory 410 | - GPU can run many more tasks at once than the CPU can, but each of those tasks run more slowly on the GPU than on the CPU 411 | - Not a good tool for tasks that require exceedingly large amounts of data, many conditional manipulations of the data, or changing data 412 | 413 | 1. Ensure that the memory use of the problem will fit withing the GPU 414 | 2. Evaluate whether the algorithm requires a lot of branching conditions versus vectorized operations 415 | 3. Evaluate how much data needs to be moved between the GPU and the CPU 416 | 417 | # Ch8. Asynchronous I/O 418 | *I/O bound program*: the speed is bounded by the efficiency of the input/output 419 | 420 | Asynchronous I/O helps utilize the wasted *I/O wait* time by allowing us to perform other operations while we are in that state 421 | 422 | ## Introduction to asynchronous programming 423 | - *Context switch*: when a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request 424 | - **Callback paradigm**: functions are called with an argument that is generally called the callback -> instead of the function returing its value, it call the callback function with the value instead -> long chains = "callback hell" 425 | - **Future paradigm**: an asynchronous function returns a `Future` object, which is a promise of a future result 426 | - `asyncio` standard library module and PEP 492 made the future's mechanism native to Python 427 | 428 | ## How does async/await work? 429 | - `async` function (defined with `async def`) is called a *coroutine* 430 | - Coroutines are implemented with the same philosophies as generators 431 | - `await` is similar in function to a `yield` -> the execution of the current function gets paused while other code is run 432 | 433 | ### Gevent 434 | - Patches the standard library with asynchronous I/O functions, 435 | - Has a `Greenlets` object that can be used for concurrent execution 436 | - Ideal solution for mainly CPU-based problems that sometimes involve heavy I/O 437 | 438 | ### tornado 439 | - Frequently used package for asynchronous I/O in Python 440 | - Originally developed by Facebook primarily for HTTP clients and servers 441 | - Ideal for any application that is mostly I/O-bound and where most of the application should be asynchronous 442 | - Performant web server 443 | 444 | ### aiohttp 445 | - Built entirely on the `asyncio` library 446 | - Provides both HTTP client and server functionality 447 | - Uses a similar API to `tornado` 448 | 449 | ### Batched results 450 | - *Pipelining*: batching results -> can help lower the burden of an I/O task 451 | - Good compromise between the speeds of asynchronous I/O and the ease of writing serial programs 452 | 453 | # Ch9. The multiprocessing module 454 | - Additional process = more communication overhead = decrease available RAM -> rarely get a full *n*-times speedup 455 | - If you run out of RAM and the system reverts to using the disk’s swap space, any parallelization advantage will be massively lost to the slow paging of RAM back and forth to disk 456 | - Using hyperthreads: CPython uses a lot of RAM -> hyperthreading is not cache friendly. Hyperthreads = added bonus and not a resource to be optimized against -> adding more CPUs is more economical than tuning your code 457 | - **Amdahl's law**: if only a small part of your code can be parallelized, it doesn't matter how many CPUs you throw at it; it still won't run much faster overall 458 | - `multiprocessing` module: process and thread-based parallel processing, share work over queues, and share data among processes -> focus: single-machine multicore parallelism 459 | - `multiprocessing`: higher level, sharing Python data structures 460 | - `OpenMP`: works with C primitive objects once you've compiled to C 461 | 462 | > Keep the parallelism as simple as possible so that your development velocity is kept high 463 | 464 | - *Embarrassingly parallel*: multiple Python processes all solving the same problem without communicating with one another -> not much penalty will be incurred as we add more and more Python processes 465 | 466 | Typical jobs for the `multiprocessing` module: 467 | - Parallelize a CPU-bound task with `Process` or `Pool` objects 468 | - Parallelize an I/O-bound task in a `Pool` with threads using the `dummy` module 469 | - Share pickled work via a `Queue` 470 | - Share state between parallelized workers, including bytes, primitive datatypes, dictionaries, and lists 471 | 472 | > `Joblib`: stronger cross-platform support than `multiprocessing` 473 | 474 | ## Replacing multiprocessing with Joblib 475 | - `Joblib` is an improvement on `multiprocessing` 476 | - Enables lightweight pipelining with a focus on: 477 | - easy parallel computing 478 | - transparent disk-based caching of results 479 | - It focuses on NumPy arrays for scientific computing 480 | - Quick wins: 481 | - process a loop that could be embarrassingly parallel 482 | - expensive functions that have no side effect 483 | - able to share `numpy` data between processes 484 | - `Parallel` class: sets up the process pool 485 | - `delayed` decorator: wraps our target function so it can be applied to the instantiated `Parallel` object via an iterator 486 | 487 | ### Intelligent caching of function call results 488 | `Memory` cache: decorator that caches functions results based on the input arguments to a disk cache 489 | 490 | ### Using numpy 491 | - `numpy` is more cache friendly 492 | - `numpy` can achieve some level of additional speedup around threads by working outside the GIL 493 | 494 | ## Asynchronous systems 495 | Require a special level of patience. Suggestions: 496 | - K.I.S.S. 497 | - Avoiding asynchronous self-contained systems if possible, as they will grow in complexity and quickly become hard to maintain 498 | - Using mature libraries like `gevent` that give you tried-and-tested approaches to dealing with certain problem sets 499 | 500 | ## Interprocess Communication (IPC) 501 | - Cooperation cost can be high: synchronizing data and checking the shared data 502 | - Sharing state tends to make things complicated 503 | - IPC is fairly easy but generally comes with a cost 504 | 505 | ### multiprocessing.Manager() 506 | - Lets us share higher-level Python objects between processes as managed shared objects; the lower-level objects are wrapped in proxy objects 507 | - The wrapping and safety have a speed cost but also offer great flexibility. 508 | - You can share both lower-level objects (e.g., integers and floats) and lists and dictionaries. 509 | 510 | ### Redis 511 | - **Key/value in-memory storage engine**. It provides its own locking and each operation is atomic, so we don’t have to worry about using locks from inside Python (or from any other interfacing language). 512 | - Lets you share state not just with other Python processes but also other tools and other machines, and even to expose that state over a web-browser interface 513 | - Redis lets you store: Lists of strings; Sets of strings; Sorted sets of strings; Hashes of strings 514 | - Stores everything in RAM and snapshots to disk 515 | - Supports master/slave replication to a cluster of instances 516 | - Widely used in industry and is mature and well trusted 517 | 518 | ## mmap 519 | - Memory-mapped (shared memory) solution 520 | - The bytes in a shared memory block are not synchronized and they come with very little overhead 521 | - Bytes act like a file -> block of memory with a file-like interface 522 | 523 | # Ch10. Clusters and Job Queues 524 | *Cluster*: collection of computers working together to solve a common task 525 | 526 | Before moving to a clustered solution: 527 | - Profile your system to understand the bottlenecks 528 | - Exploit compile solutions (Numba, Cython) 529 | - Exploit multiple cores on a single machine (Joblib, multiprocessing) 530 | - Exploit techniques for using less RAM 531 | - Really need a lot of CPUs, high resiliency, rapid speed of response, ability to process data from disks in parallel 532 | 533 | ## Benefits of clustering 534 | - Easily scale computing requirements 535 | - Improve reliability 536 | - Dynamic scaling 537 | 538 | ## Drawbacks of clustering 539 | - Change in thinking 540 | - Latency between machines 541 | - Sysadmin problems: software versions between machines, are other machines working? 542 | - Moving parts that need to be in sync 543 | - "If you don't have a documented restart plan, you should assume you'll have to write one at the worst possible time" 544 | 545 | > Using a cloud-based cluster can mitigate a lot of these problems, and some cloud providers also offer a spot-priced market for cheap but temporary computing resources. 546 | 547 | - A system that's easy to debug *probably* beats having a faster system 548 | - Engineering time and the cost of downtime are *probably* your largest expenses 549 | 550 | ## Parallel Pandas with Dask 551 | - Provide a suite of parallelization solutions that scales from a single core on a laptop to multicore machines to thousands of cores in a cluster. 552 | - "Apache Spark lite” 553 | - For `Pandas` users: larger-than-RAM datasets and desire for multicore parallelization 554 | 555 | ### Dask 556 | - *Bag*: enables parallelized computation on unstructured and semistructured data 557 | - *Array*: enables distributed and larger-than-RAM `numpy` operations 558 | - *Distributed DataFrame*: enables distributed and larger-than-RAM `Pandas` operations 559 | - *Delayed*: parallelize chains of arbitrary Python functions in a lazy fashion 560 | - *Futures*: interface that includes `Queue` and `Lock` to support task collaboration 561 | - *Dask-ML*: scikit-learn-like interface for scalable machine learning 562 | 563 | > You can use Dask (and Swifter) to parallelize any side-effect-free function that you’d usually use in an `apply` call 564 | 565 | - `npartitions` = # cores 566 | 567 | #### Swifter 568 | Builds on Dask to provide three parallelized options with very simple calls: `apply`, `resample` and `rolling` 569 | 570 | ### Vaex 571 | - String-heavy DataFrames 572 | - Larger-than-RAM datasets 573 | - Subsets of a DataFrame -> Implicit lazy evaluation 574 | 575 | ## NSQ for robust production clustering 576 | - Highly performant distributed messaging platform 577 | - *Queues*: type of buffer for messages 578 | - *Pub/subs*: describes who gets what messages (*publisher/subscriber*) 579 | 580 | # Ch11. Using less RAM 581 | - Counting the amount of RAM used by Python object is tricky -> if we ask the OS for a count of bytes used, it will tell us the total amount allocated to the process 582 | - Each unique object has a memory cost 583 | 584 | ## Objects for primitives are expensive 585 | `memory_profiler` 586 | 587 | ``` 588 | %load_ext memory_profiler 589 | 590 | %memit 591 | ``` 592 | 593 | ### The `array` module stores many primitive objects cheaply 594 | - Creates a contiguos block of RAM to hold the underlying data. Which data structures: 595 | - integers, floats and characters 596 | - *not* complex numbers or classes 597 | - Good to pass the array to an external process or use only some of the data (not to compute on them) 598 | - Using a regular `list` to store many numbers is much less efficient in RAM than using an `array` object 599 | - `numpy` arrays are almost certainly a better choice if you are doing anything heavily numeric: 600 | - more datatype options 601 | - many specialized and fast functions 602 | 603 | ### Using less RAM in NumPy with NumExpr 604 | `NumExpr` is a tool that both speeds up and reduces the size of intermediate operations 605 | 606 | > Install the optional NumExpr when using Pandas (Pandas does not tell you if you haven’t installed NumExpr) -> calls to `eval` will run more quickly -> import numpexpr: if this fails, install it! 607 | 608 | - NumExpr breaks the long vectors into shorter, cache-friendly chunks and processes each in series, so local chunks of results are calculated in a cache-friendly way 609 | 610 | ## Bytes versus Unicode 611 | - Python 3.x, all strings are Unicode by default, and if you want to deal in bytes, you'll explicitly create a `byte` sequence 612 | - **UTF-8 encoding** of a Unicode object uses 1 byte per ASCII character and more bytes for less frequently seen characters 613 | 614 | ## More efficient tree structures to represent strings 615 | - **Tries**: share common prefixes 616 | - **DAWG**: share common prefixes and suffixes 617 | - Overlapping sequences in your strings -> you'll likely see a RAM improvement 618 | - Save RAM and time in exchange for a little additional effort in preparation 619 | - Unfamiliar data structures to many developers -> isolate in a module to simplify maintenance 620 | 621 | ### Directed Acyclic Word Graph (DAWG) 622 | Attemps to efficiently represent strings that share common prefixes and suffixes 623 | 624 | ### Marisa Trie 625 | *Static trie* using Cython bindings to an external library -> it cannot be modified after construction 626 | 627 | ## Scikit-learn's DictVectorizer and FeatureHasher 628 | - `DictVectorizer`: takes a dictionary of terms and their frequences and converts them into a variable-width sparse matrix -> it is possible to revert the process 629 | - `FeatureHasher`: converts the same dictionary of terms and frequencies into a fixed-width sparse matrix -> it doesn't store a vocabulary and instead employs a hashing algorithm to assign token frequencies to columns -> can't convert it back to the original token from hash 630 | 631 | ## ScyPy's Sparse Matrices 632 | - Matrix in which most matrix elements are 0 633 | - `C00` matrices: simplest implementation: each non-zero element we store the value in addition to the location of the value -> each non-zero value = 3 numbers stored -> used only to contruct sparse matrices and not for actual computation 634 | - `CSR/CSC` is preferred for computation 635 | 636 | > Push and pull of speedups with sparse arrays: balance between losing the use of efficient caching and vectorization versus not having to do a lot of the calculations associated with the zero values of the matrix 637 | 638 | Limitations: 639 | - Low amount of support 640 | - Multiple implementations with benefits and drawbacks 641 | - May require expert knowledge 642 | 643 | ## Tips for using less RAM 644 | > "If you can avoid putting it into RAM, do. Everything you load costs you RAM" 645 | 646 | - Numeric data: switch to using `numpy` arrays 647 | - Very sparse arrays: SciPy's sparse array functionality 648 | - Strings: stick to `str` rather than `bytes` 649 | - Many Unicode objects in a static structure: DAWG and trie structures 650 | - Lots of bit strings: `numpy` and the `bitarray` package 651 | 652 | ## Probabilistic Data Structures 653 | - Make trade-offs in accuracy for immense decrease in memory usage 654 | - The number of operations you can do on them is much more restricted 655 | 656 | > “Probabilistic data structures are fantastic when you have taken the time to understand the problem and need to put something into production that can answer a very small set of questions about a very large set of data” 657 | 658 | - "lossy compression": find an alternative representation for the data that is more compact and contains the relevant information for answering a certain set of questions 659 | 660 | ### Morris counter 661 | Keeps track of an exponent and models the counted state as `2^exponent` -> provides an *order of magnitude* estimate 662 | 663 | ### K-Minimum values 664 | If we keep the `k` smallest unique hash values we have seen, we can **approximate the overall spacing between hash values** and infer the total number of items 665 | - *idempotence*: if we do the same operation, with the same inputs, on the structure multiple times, the state will not be changed 666 | 667 | ### Bloom filters 668 | - Answer the question of **whether we've seen an item before** 669 | - Work by having multiple hash values in order to represent a value as multiple integers. If we later see something with the same set of integers, we can be reasonably confident that it is the same value 670 | - **No false negatives and a controllable rate of false positives** 671 | - Set to have error rates below 0.5% 672 | 673 | # Ch12. Lessons from the field 674 | -------------------------------------------------------------------------------- /learning-python-design-patterns/cover.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/learning-python-design-patterns/cover.jpg -------------------------------------------------------------------------------- /learning-python-design-patterns/notes.md: -------------------------------------------------------------------------------- 1 | # Learning Python Design Patterns 2 | 3 | Author: Chetan Giridhar 4 | 5 | [Available here](https://www.amazon.com/Learning-Python-Design-Patterns-Second-ebook/dp/B018XYKNOM) 6 | 7 | ![learning-python-design-patterns](cover.jpg) 8 | 9 | - [Learning Python Design Patterns](#learning-python-design-patterns) 10 | - [Ch1. Introduction to design patterns](#ch1-introduction-to-design-patterns) 11 | - [Understanding object-oriented programming](#understanding-object-oriented-programming) 12 | - [Classes](#classes) 13 | - [Methods](#methods) 14 | - [Major aspects of OOP](#major-aspects-of-oop) 15 | - [Encapsulation](#encapsulation) 16 | - [Polymorphism](#polymorphism) 17 | - [Inheritance](#inheritance) 18 | - [Abstraction](#abstraction) 19 | - [Composition](#composition) 20 | - [Object-oriented design principles](#object-oriented-design-principles) 21 | - [The open/close principle](#the-openclose-principle) 22 | - [The inversion of control principle](#the-inversion-of-control-principle) 23 | - [The interface segregation principle](#the-interface-segregation-principle) 24 | - [The single responsibility principle](#the-single-responsibility-principle) 25 | - [The substitution principle](#the-substitution-principle) 26 | - [The concept of design patterns](#the-concept-of-design-patterns) 27 | - [Advantages of design patterns](#advantages-of-design-patterns) 28 | - [Patterns for dynamic languages](#patterns-for-dynamic-languages) 29 | - [Classifying patterns](#classifying-patterns) 30 | - [Creational patterns](#creational-patterns) 31 | - [Structural patterns](#structural-patterns) 32 | - [Behavioral patterns](#behavioral-patterns) 33 | - [Ch2. The singleton design pattern](#ch2-the-singleton-design-pattern) 34 | - [Monostate Singleton pattern](#monostate-singleton-pattern) 35 | - [Singletons and metaclasses](#singletons-and-metaclasses) 36 | - [Drawbacks](#drawbacks) 37 | - [Ch3. The factory pattern - building factories to create objects](#ch3-the-factory-pattern---building-factories-to-create-objects) 38 | - [Understanding the factory pattern](#understanding-the-factory-pattern) 39 | - [The simple factory pattern](#the-simple-factory-pattern) 40 | - [The factory method pattern](#the-factory-method-pattern) 41 | - [The abstract factory pattern](#the-abstract-factory-pattern) 42 | - [Factory method versus abstract factory method](#factory-method-versus-abstract-factory-method) 43 | - [Ch4. The façade pattern - being adaptive with façade](#ch4-the-façade-pattern---being-adaptive-with-façade) 44 | - [Understanding Structural design patterns](#understanding-structural-design-patterns) 45 | - [Understanding the Façade design pattern](#understanding-the-façade-design-pattern) 46 | - [Main participants](#main-participants) 47 | - [The principle of least knowledge](#the-principle-of-least-knowledge) 48 | - [The Law of Demeter](#the-law-of-demeter) 49 | - [Ch5. The proxy pattern - controlling object access](#ch5-the-proxy-pattern---controlling-object-access) 50 | - [Data Structure components](#data-structure-components) 51 | - [Different types of proxies](#different-types-of-proxies) 52 | - [Decorator vs Proxy](#decorator-vs-proxy) 53 | - [Disadvantages](#disadvantages) 54 | - [Ch6. The observer pattern - keeping objects in the know](#ch6-the-observer-pattern---keeping-objects-in-the-know) 55 | - [Behavioral patterns](#behavioral-patterns-1) 56 | - [Understanding the observer design pattern](#understanding-the-observer-design-pattern) 57 | - [The pull model](#the-pull-model) 58 | - [The push model](#the-push-model) 59 | - [Loose coupling and the observer pattern](#loose-coupling-and-the-observer-pattern) 60 | - [Ch7. The command pattern - encapsulating invocation](#ch7-the-command-pattern---encapsulating-invocation) 61 | - [Understanding the command design pattern](#understanding-the-command-design-pattern) 62 | - [Intentions](#intentions) 63 | - [Scenarios of use](#scenarios-of-use) 64 | - [Advantages](#advantages) 65 | - [Disadvantages](#disadvantages-1) 66 | - [Ch8. The templated method pattern - encapsulating algorithm](#ch8-the-templated-method-pattern---encapsulating-algorithm) 67 | - [Use cases](#use-cases) 68 | - [Intentions](#intentions-1) 69 | - [Terms](#terms) 70 | - [Hooks](#hooks) 71 | - [The Hollywood principle](#the-hollywood-principle) 72 | - [Advantages](#advantages-1) 73 | - [Disadvantages](#disadvantages-2) 74 | - [Ch9. Model-View-Controller - Compound patterns](#ch9-model-view-controller---compound-patterns) 75 | - [The Model-View-Controller pattern](#the-model-view-controller-pattern) 76 | - [Terms](#terms-1) 77 | - [Intention](#intention) 78 | - [The MVC pattern in the real world](#the-mvc-pattern-in-the-real-world) 79 | - [Benefits of the MVC pattern](#benefits-of-the-mvc-pattern) 80 | - [Ch10. The state design pattern](#ch10-the-state-design-pattern) 81 | - [Understanding the state design pattern](#understanding-the-state-design-pattern) 82 | - [Advantages](#advantages-2) 83 | - [Disadvantages](#disadvantages-3) 84 | - [Ch11. AntiPatterns](#ch11-antipatterns) 85 | - [Software development AntiPatterns](#software-development-antipatterns) 86 | - [Spaghetti code](#spaghetti-code) 87 | - [Golden Hammer](#golden-hammer) 88 | - [Lava Flow](#lava-flow) 89 | - [Copy-and-paste or cut-and-paste programming](#copy-and-paste-or-cut-and-paste-programming) 90 | - [Software architecture AntiPatterns](#software-architecture-antipatterns) 91 | - [Reinventing the wheel](#reinventing-the-wheel) 92 | - [Vendor lock-in](#vendor-lock-in) 93 | - [Design by committee](#design-by-committee) 94 | 95 | # Ch1. Introduction to design patterns 96 | 97 | ## Understanding object-oriented programming 98 | - Concept of *objects* that have attributes (data members) and procedures (member functions) 99 | - Procedures are responsible for manipulating the attributes 100 | - Objects, which are instances of classes, interact among each other to serve the purpose of an application under development 101 | 102 | ### Classes 103 | - Define objects in attributes and behaviors (methods) 104 | - Classes consist of constructors that provide the initial state for these objects 105 | - Are like templates and hence can be easily reused 106 | 107 | ### Methods 108 | - Represent the behavior of the object 109 | - Work on attributes and also implement the desired functionality 110 | 111 | ## Major aspects of OOP 112 | 113 | ### Encapsulation 114 | - An object's behavior is kept hidden from the outside world or objects keep their state information private 115 | - Clients can't change the object's internal state by directly acting on them 116 | - Clients request the object by sending requests. Based on the type, objects may respond by changing their internal state using special member functions such as `get` and `set` 117 | 118 | ### Polymorphism 119 | - Can be of two types: 120 | - An object provides different implementations of the method based on input parameters 121 | - The same interface can be used by objects of different types 122 | - In Python polymorphism is a feature built-in for the language (e.g. + operator) 123 | 124 | ### Inheritance 125 | - Indicates that one class derives (most of) its functionality from the parent class 126 | - An option to reuse functionality defined in the base class and allow independent extensions of the original software implementation 127 | - Creates hierarchy via the relationships among objects of different classes 128 | - Python supports multiple inheritance (multiple base classes) 129 | 130 | ### Abstraction 131 | - Provides a simple interface to the clients. Clients can interact with class objects and call methods defined in the interface 132 | - Abstracts the complexity of internal classes with an interface so that the client need not be aware of internal implementations 133 | 134 | ### Composition 135 | - Combine objects or classes into more complex data structures or software implementations 136 | - An object is used to call member functions in other modules thereby making base functionality available across modules without inheritance 137 | 138 | ## Object-oriented design principles 139 | 140 | ### The open/close principle 141 | > **Classes or objects and methods should be open for extension but closed for modifications** 142 | 143 | - Make sure you write your classes or modules in a generic way 144 | - Existing classes are not changed reducing the chances of regression 145 | - Helps maintain backward compatibility 146 | 147 | ### The inversion of control principle 148 | > **High-level modules shouldn't be dependent on low-level modules; they should be dependent on abstractions. Details should depend on abstractions and not the other way round** 149 | 150 | - The base module and dependent module should be decoupled with an abstraction layer in between 151 | - The details of your class should represent the abstractions 152 | - The tight coupling of modules is no more prevalent and hence no complexity/rigidity in the system 153 | - Easy to deal with dependencies across modules in a better way 154 | 155 | ### The interface segregation principle 156 | > **Clients should not be force to depend on interfaces they don't use** 157 | 158 | - Forces developers to write thin interfaces and have methods that are specific to the interface 159 | - Helps you not to populate interfaces by adding unintentional methods 160 | 161 | ### The single responsibility principle 162 | > **A class should have only one reason to change** 163 | 164 | - If a class is taking care of two functionalities, it is better to split them 165 | - Functionality = a reason to change 166 | - Whenever there is a change in one functionality, this particular class needs to change, and nothing else 167 | - If a class has multiple functionalities, the dependent classes will have to undergo changes for multiple reasons, which gets avoided 168 | 169 | ### The substitution principle 170 | > **Derived classes must be able to completely substitute the base classes** 171 | 172 | ## The concept of design patterns 173 | - Solutions to given problems 174 | - Design patterns are discoveries and not a invention in themselves 175 | - Is about learning from others' successes rather than your own failures! 176 | 177 | ### Advantages of design patterns 178 | - Reusable across multiple projects 179 | - Architectural level of problems can be solved 180 | - Time-tested and well-proven, which is the experience of developers and architects 181 | - They have reliability and dependence 182 | 183 | ### Patterns for dynamic languages 184 | Python: 185 | - Types or classes are objects at runtime 186 | - Variables can have type as a value and can be modified at runtime 187 | - Dynamic languages have more flexibility in terms of class restrictions 188 | - Everything is public by default 189 | - Design patterns can be easily implemented in dynamic languages 190 | 191 | ## Classifying patterns 192 | - Creational 193 | - Structural 194 | - Behavioral 195 | 196 | > Classification of patterns is done based primarily on how the objects get created, how classes and objects are structured in a software application, and also covers the way objects interact among themselves 197 | 198 | ### Creational patterns 199 | - Work on the basis of how objects can be created 200 | - Isolate the details of object creation 201 | - Code is independent of the type of object to be created 202 | 203 | ### Structural patterns 204 | - Design the structure of objects and classes so that they can compose to achieve larger results 205 | - Focus on simplifying the structure and identifying the relationship between classes and objects 206 | - Focus on class inheritance and composition 207 | 208 | ### Behavioral patterns 209 | - Concerned with the interaction among objects and responsibilities of objects 210 | - Objects should be able to interact and still be loosely coupled 211 | 212 | # Ch2. The singleton design pattern 213 | - Typically used in logging or database operations, printer spoolers, thread pools, caches, dialog boxes, registry settings, and so on 214 | - Ensure that only one object of the class gets created 215 | - Provide an access point for an object that is global to the program 216 | - Control concurrent access to resources that are shared 217 | - Make the constructor private and create a static method that does the object initialization 218 | - Override the `__new__` method (Python's special method to instantiate objects) to control the object creation 219 | - Another use case: **lazy instantiation**. Makes sure that the object gets created when it's actually needed 220 | - All modules are Singletons by default because of Python's importing behavior 221 | 222 | ## Monostate Singleton pattern 223 | - All objects share the same state 224 | - Assign the `__dict__` variable with the `__shared_state` class variable. Python uses `__dict__` to store the state of every object of a class 225 | 226 | ## Singletons and metaclasses 227 | - A metaclass is a class of a class 228 | - The class is an instance of its metaclass 229 | - Programmers get an opportunity to create classes of their own type from the predefined Python classes 230 | 231 | ## Drawbacks 232 | - Singletons have a global point of access 233 | - Al classes that are dependent on global variables get tightly coupled as a change to the global data by one class can inadvertently impact the other class 234 | 235 | # Ch3. The factory pattern - building factories to create objects 236 | 237 | ## Understanding the factory pattern 238 | - Factory = a class that is responsible for creating objects of other types 239 | - The class that acts as a factory has an object and methods associated with it 240 | - The client calls this method with certain parameters; objects of desired types are created in turn and returned to the client by the factory 241 | 242 | **Advantages** 243 | - Loose coupling: object creation can be independent of the class implementation 244 | - The client only needs to know the interface, methods, and parameters that need to be passed to create objects of the desired type (simplifies implementations for the client) 245 | - Adding another class to the factory to create objects of another type can be easily done without the client changing the code 246 | 247 | ## The simple factory pattern 248 | - Not a pattern in itself 249 | - Helps create objects of different types rather than direct object instantiation 250 | 251 | ## The factory method pattern 252 | - We define an interface to create objects, but instead of the factory being responsible for the object creation, the responsibility is deferred to the subclass that decides the class to be instantiated 253 | - Creation is through inheritance and not through instantiation 254 | - Makes the design more customizable. It can return the same instance or subclass rather than an object of a certain type 255 | 256 | > The factory method pattern defines an interface to create an object, but defers the decision ON which class to instantiate to its subclasses 257 | 258 | **Advantages** 259 | - Makes the code generic and flexible, not being tied to a certain class for instantiation. We're dependent on the interface (Product) and not on the ConcreteProduct class 260 | - Loose coupling: the code that creates the object is separate from the code that uses it 261 | - The client don't need to bother about what argument to pass and which class to instantiate -> the addition of new classes is easy and involves low maintenance 262 | 263 | ## The abstract factory pattern 264 | 265 | > Provide an interface to create families of related objects without specifying the concrete class 266 | 267 | - Makes sure that the client is isolated from the creation of objects but allowed to use the objects created 268 | 269 | ## Factory method versus abstract factory method 270 | 271 | | **Factory method** | **Abstract Factory method** | 272 | | :------------------------------------------------------------: | :----------------------------------------------------------------------------: | 273 | | Exposes a method to the client to create the objects | Contains one or more factory methods of another class | 274 | | Uses inheritance and subclass to decide which object to create | Uses composition to delegate responsibility to create objects of another class | 275 | | Is used to create one product | Is about creating families of related products | 276 | 277 | # Ch4. The façade pattern - being adaptive with façade 278 | 279 | ## Understanding Structural design patterns 280 | - Describe how objects and classes can be combined to form larger structures. Structural patterns are a combination of class and object patterns 281 | - Ease the design by identifying simpler ways to realize or demonstrate relationships between entities 282 | - Class patterns: describe abstraction with the help of inheritance and provide a more useful program interface 283 | - Object patterns: describe how objects can be associated and composed to form larger objects 284 | 285 | ## Understanding the Façade design pattern 286 | > Façade hides the complexities of the internal system and provides an interface to the client that can access the system in a very simplified way 287 | 288 | - Provides an unified interface to a set of interfaces in a subsystem and defines a high-level interface that helps the client use the subsystem in an easy way 289 | - Discusses representing a complex subsystem with a single interface object -> it doesn't encapsulate the subsystem, but actually combines the underlying subsystems 290 | - Promotes the decoupling of the implementation with multiple clients 291 | 292 | ## Main participants 293 | - **Façade**: wrap up a complex group of subsystems so that it can provide a pleasing look to the outside world 294 | - **System**: represents a set of varied subsystems that make the whole system compound and difficult to view or work with 295 | - **Client**: interact with the façade so that it can easily communicate with the subsystem and get the work completed (doesn't have to bother about the complex nature of the system) 296 | 297 | ## The principle of least knowledge 298 | - Design principle behind Façade pattern 299 | - Reduce the interactions between objects to just a few friend that are close enough to you 300 | 301 | ## The Law of Demeter 302 | - Design guideline: 303 | - Each unit should have only limited knowledge of other units of the system 304 | - A unit should talk to its friends only 305 | - A unit should not know about the internal details of the object that it manipulates 306 | 307 | > The principle of least knowledge and Law of Demeter are the same and both point to the philosophy of *loose coupling* 308 | 309 | # Ch5. The proxy pattern - controlling object access 310 | > Proxy: a system that intermediates between the seeker and provider. Seeker is the one that makes the request, and provider delivers the resources in response to the request 311 | 312 | - A proxy server encapsulates requests, enables privacy, and works well in distributed architectures 313 | - Proxy is a wrapper or agent object that wraps the real serving object 314 | - Provide a surrogate or placeholder for another object in order to control access to a real object 315 | - Some useful scenarios: 316 | - Represents a complex system in a simpler way 317 | - Acts as a shield against malicious intentions and protect the real object 318 | - Provides a local interface for remote objects on different servers 319 | - Provides a light handle for a higher memory-consuming object 320 | 321 | ## Data Structure components 322 | - **Proxy** 323 | - **Subject/RealSubject** 324 | - **Client** 325 | 326 | ## Different types of proxies 327 | - **Virtual proxy**: placeholder for objects that are very heavy to instantiate 328 | - **Remote proxy**: provides a local representation of a real object that resides on a remote server or different address space 329 | - **Protective proxy**: controls access to the sensitive matter object of `RealSubject` 330 | - **Smart proxy**: interposes additional actions when an object is accessed 331 | 332 | | Proxy | Façade | 333 | | --------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- | 334 | | Provides you with a surrogate or placeholder for another object to control access to it | Provides you with an interface to large subsystems of classes | 335 | | A Proxy object has the same interface as that of the target object and holds references to target objects | Minimizes the communication and dependencies between subsystems | 336 | | Acts as an intermediary between the client and object that is wrapped | Provides a single simplified interface | 337 | 338 | ### Decorator vs Proxy 339 | - Decorator adds behavior to the object that it decorates at runtime 340 | - Proxy controls access to an object 341 | 342 | ### Disadvantages 343 | - Proxy pattern can increase the response time 344 | 345 | # Ch6. The observer pattern - keeping objects in the know 346 | 347 | ## Behavioral patterns 348 | - Focus on the responsibilities that an object has 349 | - Deal with the interaction among objects to achieve larger functionality 350 | - Objects should be able to interact with each other, **but they should still be loosely coupled** 351 | 352 | ## Understanding the observer design pattern 353 | > An object (Subject) maintains a list of dependents (Observers) so that the Subject can notify all the Observers about the changes that it undergoes using any of the methods defined by the Observer 354 | 355 | - Defines a one-to-many dependency between objects so that any change in one object will be notified to the other dependent objects automatically 356 | - Encapsulates the core component of the Subject 357 | 358 | ## The pull model 359 | - Subject broadcasts to all the registered Observers when there is any change 360 | - Observer is responsible for getting the changes or pulling data from the subscriber when there is an amendment 361 | - Pull model is **ineffective**: involves two steps: 362 | - Subject notifies the Observer 363 | - Observer pulls the required data from the Subject 364 | 365 | ## The push model 366 | - Changes are pushed by the Subject to the Observer 367 | - Subject can send detailed information to the Observer (even though it may not be needed) -> can result in sluggish response times when a large amount of data in sent by the Subject but is never actually used by the Observer 368 | 369 | ## Loose coupling and the observer pattern 370 | - Coupling refers to the degree of knowledge that one object has about the other object that it interacts with 371 | 372 | > Loosely-coupled designs allow us to build flexbile object-oriented systems that can handle changes because they reduce the dependency between multiple objects 373 | 374 | - Reduces the risk that a change made within one element might create an unanticipated impact on the other elements 375 | - Simplifies testing, maintenance, and troubleshooting problems 376 | - System can be easily broken down into definable elements 377 | 378 | # Ch7. The command pattern - encapsulating invocation 379 | > Behavioral design pattern in which an object is used to encapsulate all the information needed to perform an action or trigger an event at a later time 380 | 381 | ## Understanding the command design pattern 382 | - A `Command` object knows about the `Receiver` objects and invokes a method of the `Receiver` object 383 | - Values for parameters of the receiver method are stored in the `Command` object 384 | - The invoker knows how to execute a command 385 | - The client creates a `Command` object and sets its receiver 386 | 387 | ### Intentions 388 | - Encapsulating a request as an object 389 | - Allowing the parametrization of clients with different requests 390 | - Allowing to save the requests in a queue 391 | - Providing an object-oriented callback 392 | 393 | ### Scenarios of use 394 | - Parametrizing objects depending on the action to be performed 395 | - Adding actions to a queue and executing requests at different points 396 | - Creating a structure for high-level operations that are based on smaller operations 397 | - E.g.: 398 | - Redo or rollback operations 399 | - Asynchronous task execution 400 | 401 | ## Advantages 402 | - Decouples the classes that invoke the operation from the object that knows how to execute the operation 403 | - Provide a queue system 404 | - Extensions to add new commands are easy 405 | - A rollback system with the command pattern can be defined 406 | 407 | ## Disadvantages 408 | - High number of classes and objects working together to achieve a goal 409 | - Every individual command is a `ConcreteCommand` class that increases the volume of classes for implementation and maintenance 410 | 411 | # Ch8. The templated method pattern - encapsulating algorithm 412 | 413 | ## Use cases 414 | - When multiple algorithms or classes implements similar or identical logic 415 | - The implementation of algorithms in subclasses helps reduce code duplication 416 | - Multiple algorithms can be defined by letting the subclasses implement the behavior through overriding 417 | 418 | ## Intentions 419 | - Define a skeleton of an algorithm with primitive operations 420 | - Redefine certain operations of the subclass without changing the algorithm's structure 421 | - Achieve code reuse and avoid duplicate efforts 422 | - Leverage common interfaces or implementations 423 | 424 | ## Terms 425 | - `AbstractClass`: Declares an interface to define the steps of the algorithm 426 | - `ConcreteClass`: Defines subclass-specific step definitions 427 | - `template_method()`: Defines the algorithm by calling the step methods 428 | 429 | ## Hooks 430 | - Hook: a method that is declared in the abstract class 431 | - Give a subclass the ability to *hook into* the algorithm whenever needed 432 | - Not imperative for the subclass to use hooks 433 | 434 | > We use abstract methods when the subclass must provide the implementation, and hook is used when it is optional for the subclass to implement it 435 | 436 | ## The Hollywood principle 437 | - Design principle summarized by **Don't call us, we'll call you** 438 | - Relates to the template method -> it's the high-level abstract class that arranges the steps to define the algorithm 439 | 440 | ## Advantages 441 | - No code duplication 442 | - Uses inheritance and not composition -> only a few methods need to be overriden 443 | - Flexibility lets subclasses decide how implement steps in an algorithm 444 | 445 | ## Disadvantages 446 | - Confusing debugging and undestanding the sequence of flow 447 | - Documentation and strict error handling has to be done by the programmer 448 | - Maintenance can be a problem -> changes to any level can disturb the implementation 449 | 450 | # Ch9. Model-View-Controller - Compound patterns 451 | > "A compound pattern combines two or more patterns into a solution that solves a recurring or general problem" - GoF 452 | 453 | A compound pattern is not a set of patterns working together; it is a general solution to a problem 454 | 455 | ## The Model-View-Controller pattern 456 | - Model represents the data and business logic: how information is stored and queried 457 | - View is nothing but the representation: how it is presented 458 | - Controller is the one that directs the model and view to behave in a certain way: it's the glue between the two 459 | - The view and controller are dependent on the model, but not the other way around 460 | 461 | ### Terms 462 | - **Model - knowledge of the application**: store and manipulate data (create, modify and delete). Has state and methods to change states, but is not aware of how the data would be seen by the client 463 | - **View - the appearance**: build user interfaces and data displays (it should not contain any logic of its own and just display the data it receives) 464 | - **Controller - the glue**: connects the model and view (it has methods that are used to route requests) 465 | - **User**: requests for certain results based on certain actions 466 | 467 | ### Intention 468 | - Keep the data and presentation of the data separate 469 | - Easy maintenance of the class and implementation 470 | - Flexibility to change the way in which data is stored and displayed -> both are independent and hence have the flexibility to change 471 | 472 | ## The MVC pattern in the real world 473 | - Django or Rails 474 | - **MTV (Model, Template, View)**: model is the database, templates are the views, and controllers are the views/routes 475 | 476 | ## Benefits of the MVC pattern 477 | - Easy maintenance 478 | - Loose coupling 479 | - Decrease complexity 480 | - Development efforts can run independently 481 | 482 | # Ch10. The state design pattern 483 | - Behavioral design pattern 484 | - Sometimes referred to as an **objects for states** pattern 485 | - Used to develop Finite State Machines and helps to accommodate State Transaction Actions 486 | 487 | ## Understanding the state design pattern 488 | - `State`: an interface that encapsulates the object's behavior. This behavior is associated with the state of the object 489 | - `ConcreteState`: a subclass that implements the `State` interface -> implements the actual behavior associated with the object's particular state 490 | - `Context`: the interface of interest to clients. Also maintains an instance of the `ConcreteState` subclass that internally defines the implementation of the object's particular state 491 | 492 | ## Advantages 493 | - Removes the dependency on the if/else or switch/else conditional logic 494 | - Benefits of implementing polymorphic behavior, also easier to add states to support additional behavior 495 | - Improves cohesion: state-specific behaviors are aggregated into the `ConcreteState` classes, which are placed in one location in the code 496 | - Improves the flexibility to extend the behavior of the application and overall improves code maintenance 497 | 498 | ## Disadvantages 499 | - Class explosion: every state needs to be defined with the help of `ConcreteState` -> might end up writing many more classes with a small functionality 500 | - `Context` class needs to be updated to deal with each behavior 501 | 502 | # Ch11. AntiPatterns 503 | Four aspects of a bad design: 504 | - **Immobile**: hard to reuse 505 | - **Rigid**: any small change may in turn result in moving too many parts of the software 506 | - **Fragile**: any change results in breaking the existing system fairly easy 507 | - **Viscose**: changes are done in the code or envinronment itself to avoid difficult architectural level changes 508 | 509 | > An AntiPattern is an outcome of a solution to recurring problems so that the outcome is innefective and becomes counterproductive 510 | 511 | AntiPatterns may be the result of: 512 | - A developer not knowing the software development practices 513 | - A developer not applying design patterns in the correct context 514 | 515 | ## Software development AntiPatterns 516 | Software deviates from the original code structure due to: 517 | - The tought process of the developer evolves with development 518 | - Use cases change based on customer feedback 519 | - Data structures designed initially may undergo change with functionality or scalability considerations 520 | 521 | > Refactoring is one of the critical parts of the software development journey, which provides developers an opportunity to relook the data structures and think about scalability and ever-evolving customer's needs 522 | 523 | ### Spaghetti code 524 | - Minimum reuse of structures is possible 525 | - Maintenance efforts are too high 526 | - Extension and flexibility to change is reduced 527 | 528 | ### Golden Hammer 529 | - One solution is obsessively applied to all software projects 530 | - The product is describe, not by the features, but the technology used in development 531 | - In the company corridors, you hear developers talking, "That could have been better than using this" 532 | - Requirements are not completed and not in sync with user expectations 533 | 534 | ### Lava Flow 535 | - Low code coverage for developed tests 536 | - Commented code without reasons 537 | - Obsolete interfaces, or developers try to work around existing code 538 | 539 | ### Copy-and-paste or cut-and-paste programming 540 | - Similar type of issues across software application 541 | - Higher maintenance costs and increased bug life cycle 542 | - Less modular code base with the same code running into a number of lines 543 | - Inheriting issues that existed in the first place 544 | 545 | ## Software architecture AntiPatterns 546 | > Software architecture looks at modeling the software that is well understood by the development and test teams, product managers, and other stakeholders 547 | 548 | ### Reinventing the wheel 549 | - Too many solutions to solve one standard problem, with many of them not being well thought out 550 | - More time and resource utilization for the engineering team leading overbudgeting and more time to market 551 | - A closed system architecture (useful for only one product), duplication of efforts, and poor risk management 552 | 553 | ### Vendor lock-in 554 | - Release cycles and product maintenance cycles of a company's product releases are directly dependent on the vendor's release time frame 555 | - The product is developed around the technology rather than on the customer's requirements 556 | - The product's time to market is unreliable and doesn't meet customer's expectations 557 | 558 | ### Design by committee 559 | - Conflicting viewpoints between developers and architects even after the design is finalized 560 | - Overly complex design that is very difficult to document 561 | - Any change in the specification or design undergoes review by many, resulting in implementation delays 562 | -------------------------------------------------------------------------------- /master-large-data-python/cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/master-large-data-python/cover.png -------------------------------------------------------------------------------- /master-large-data-python/map_reduce.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/master-large-data-python/map_reduce.png -------------------------------------------------------------------------------- /master-large-data-python/notes.md: -------------------------------------------------------------------------------- 1 | # Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code 2 | 3 | Authors: John T. Wolohan 4 | 5 | [Available here](https://www.manning.com/books/mastering-large-datasets-with-python) 6 | 7 | ![cover](cover.png) 8 | 9 | - [Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code](#mastering-large-datasets-with-python-parallelize-and-distribute-your-python-code) 10 | - [Ch1. Introduction](#ch1-introduction) 11 | - [Procedural programming](#procedural-programming) 12 | - [Parallel programming](#parallel-programming) 13 | - [The map function for transforming data](#the-map-function-for-transforming-data) 14 | - [The reduce function for advanced transformations](#the-reduce-function-for-advanced-transformations) 15 | - [Distributed computing for speed and scale](#distributed-computing-for-speed-and-scale) 16 | - [Hadoop: A distributed framework for map and reduce](#hadoop-a-distributed-framework-for-map-and-reduce) 17 | - [Spark for high-powered map, reduce, and more](#spark-for-high-powered-map-reduce-and-more) 18 | - [AWS Elastic MapReduce (EMR) - Large datasets in the cloud](#aws-elastic-mapreduce-emr---large-datasets-in-the-cloud) 19 | - [Ch2. Accelerating large dataset work: Map and parallel computing](#ch2-accelerating-large-dataset-work-map-and-parallel-computing) 20 | - [Pattern](#pattern) 21 | - [Lazy functions for large datasets](#lazy-functions-for-large-datasets) 22 | - [Parallel processing](#parallel-processing) 23 | - [Problems](#problems) 24 | - [Inability to pickle data or functions](#inability-to-pickle-data-or-functions) 25 | - [Order-sensitive operations](#order-sensitive-operations) 26 | - [State-dependent operations](#state-dependent-operations) 27 | - [Other observations](#other-observations) 28 | - [Ch3. Function pipelines for mapping complex transformations](#ch3-function-pipelines-for-mapping-complex-transformations) 29 | - [Helper functions and function chains](#helper-functions-and-function-chains) 30 | - [Creating a pipeline](#creating-a-pipeline) 31 | - [Compose](#compose) 32 | - [Pipe](#pipe) 33 | - [Summary](#summary) 34 | - [Ch4. Processing large datasets with lazy workflows](#ch4-processing-large-datasets-with-lazy-workflows) 35 | - [Laziness](#laziness) 36 | - [Shrinking sequences with the filter function](#shrinking-sequences-with-the-filter-function) 37 | - [Combining sequences with zip](#combining-sequences-with-zip) 38 | - [Lazy file searching with iglob](#lazy-file-searching-with-iglob) 39 | - [Understanding iterators: the magic behind lazy Python](#understanding-iterators-the-magic-behind-lazy-python) 40 | - [Generators: functions for creating data](#generators-functions-for-creating-data) 41 | - [Simulations](#simulations) 42 | - [Ch5. Accumulation operations with reduce](#ch5-accumulation-operations-with-reduce) 43 | - [Three parts of reduce](#three-parts-of-reduce) 44 | - [Accumulator functions](#accumulator-functions) 45 | - [Reductions](#reductions) 46 | - [Using map and reduce together](#using-map-and-reduce-together) 47 | - [Speeding up map and reduce](#speeding-up-map-and-reduce) 48 | - [Ch6. Speeding up map and reduce with advanced parallelization](#ch6-speeding-up-map-and-reduce-with-advanced-parallelization) 49 | - [Getting the most out of parallel map](#getting-the-most-out-of-parallel-map) 50 | - [More parallel maps: `.imap` and `starmap`](#more-parallel-maps-imap-and-starmap) 51 | - [`.imap`](#imap) 52 | - [`starmap`](#starmap) 53 | - [Parallel reduce for faster reductions](#parallel-reduce-for-faster-reductions) 54 | - [Ch7. Processing truly big datasets with Hadoop and Spark](#ch7-processing-truly-big-datasets-with-hadoop-and-spark) 55 | - [Distributed computing](#distributed-computing) 56 | - [Hadoop five modules](#hadoop-five-modules) 57 | - [YARN for job scheduling](#yarn-for-job-scheduling) 58 | - [The data storage backbone of Hadoop: HDFS](#the-data-storage-backbone-of-hadoop-hdfs) 59 | - [MapReduce jobs using Python and Hadoop Streaming](#mapreduce-jobs-using-python-and-hadoop-streaming) 60 | - [Spark for interactive workflows](#spark-for-interactive-workflows) 61 | - [PySpark for mixing Python and Spark](#pyspark-for-mixing-python-and-spark) 62 | - [Ch8. Best practices for large data with Apache Streaming and mrjob](#ch8-best-practices-for-large-data-with-apache-streaming-and-mrjob) 63 | - [Unstructured data: Logs and documents](#unstructured-data-logs-and-documents) 64 | - [JSON for passing data between mapper and reducer](#json-for-passing-data-between-mapper-and-reducer) 65 | - [mrjob for pythonic Hadoop streaming](#mrjob-for-pythonic-hadoop-streaming) 66 | - [Ch9. PageRank with map and reduce in PySpark](#ch9-pagerank-with-map-and-reduce-in-pyspark) 67 | - [Map-like methods in PySpark](#map-like-methods-in-pyspark) 68 | - [Reduce-like methods in PySpark](#reduce-like-methods-in-pyspark) 69 | - [Convenience methods in PySpark](#convenience-methods-in-pyspark) 70 | - [Saving RDDs to text files](#saving-rdds-to-text-files) 71 | - [Ch10. Faster decision-making with machine learning and PySpark](#ch10-faster-decision-making-with-machine-learning-and-pyspark) 72 | - [Organizing the data for learning](#organizing-the-data-for-learning) 73 | - [Auxiliary classes](#auxiliary-classes) 74 | - [Evaluation](#evaluation) 75 | - [Cross-validation in PySpark](#cross-validation-in-pyspark) 76 | - [Ch11. Large datasets in the cloud with Amazon Web Services and S3](#ch11-large-datasets-in-the-cloud-with-amazon-web-services-and-s3) 77 | - [Objects for convenient heterogenous storage](#objects-for-convenient-heterogenous-storage) 78 | - [Parquet: A concise tabular data store](#parquet-a-concise-tabular-data-store) 79 | - [Ch12. MapReduce in the cloud with Amazon's Elastic MapReduce](#ch12-mapreduce-in-the-cloud-with-amazons-elastic-mapreduce) 80 | - [Convenient cloud clusters with EMR](#convenient-cloud-clusters-with-emr) 81 | - [AWS EMR](#aws-emr) 82 | - [Starting EMR clusters with mrjob](#starting-emr-clusters-with-mrjob) 83 | - [Machine learning in the cloud with Spark on EMR](#machine-learning-in-the-cloud-with-spark-on-emr) 84 | - [Running machine learning algorithms on a truly large dataset](#running-machine-learning-algorithms-on-a-truly-large-dataset) 85 | - [EC2 instance types and clusters](#ec2-instance-types-and-clusters) 86 | - [Software available on EMR](#software-available-on-emr) 87 | 88 | # Ch1. Introduction 89 | Map and reduce style of programming: 90 | - easily write parallel programs 91 | - organize the code around two functions: `map` and `reduce` 92 | 93 | > `MapReduce` = framework for parallel and distributed computing; `map` and `reduce` = style of programming that allows running the work in parallel with minimal rewriting and extend the work to distributed workflows 94 | 95 | **Dask** -> another tool for managing large data without `map` and `reduce` 96 | 97 | ## Procedural programming 98 | Program Workflow 99 | 1. Starts to run 100 | 2. issues an instruction 101 | 3. instruction is executed 102 | 4. repeat 2 and 3 103 | 5. finishes running 104 | 105 | ## Parallel programming 106 | Program workflow 107 | 1. Starts to run 108 | 2. divides up the work into chunks of instructions and data 109 | 3. each chunk of work is executed independently 110 | 4. chunks of work are reassembled 111 | 5. finishes running 112 | 113 | ![map_reduce](map_reduce.png) 114 | 115 | > The `map` and `reduce` style is applicable everywhere, but its specific strengths are in areas where you may need to scale 116 | 117 | ## The map function for transforming data 118 | - `map`: function to transform sequences of data from one type to another 119 | - Always retains the same number of objects in the output as were provided in the input 120 | - performs one-to-one transformations -> is a great way to transform data so it is more suitable for use 121 | 122 | > Declarative programming: focuses on explaining the logic of the code and not on specifying low-level details -> scaling is natural, the logic stays the same 123 | 124 | ## The reduce function for advanced transformations 125 | - `reduce`: transform a sequence of data into a data structure of any shape or size 126 | - MapReduce programming pattern relies on the `map` function to transform some data into another type of data and then uses the `reduce` function to combine that data 127 | - performs one-to-any transformations -> is a great way to assemble data into a final result 128 | 129 | ## Distributed computing for speed and scale 130 | Extension of parallel computing in which the computer resource we are dedicating to work on each chunk of a given task is its own machine 131 | 132 | ## Hadoop: A distributed framework for map and reduce 133 | - Designed as an open source implementation of Google's original MapReduce framework 134 | - Evolved into distributed computing software used widely by companies processing large amounts of data 135 | 136 | ## Spark for high-powered map, reduce, and more 137 | - Something of a sucessor to the Apache Hadoop framework that does more of its work in memory instead of by writing to file 138 | - Can run more than 100x faster than Hadoop 139 | 140 | ## AWS Elastic MapReduce (EMR) - Large datasets in the cloud 141 | - Popular way to implement Hadoop and Spark 142 | - tackle small problems with parallel programming as its cost effective 143 | - tackle large problems with parallel programming because we can procure as many resources as we need 144 | 145 | # Ch2. Accelerating large dataset work: Map and parallel computing 146 | `map`'s primary capabilities: 147 | - Replace `for` loops 148 | - Transform data 149 | - `map` evaluates only when necessary, not when called -> generic `map` object as output 150 | 151 | `map` makes easy to parallel code -> break into pieces 152 | 153 | ## Pattern 154 | - Take a sequence of data 155 | - Transform it with a function 156 | - Get the outputs 157 | 158 | > `Generators` instead of normal loops prevents storing all objects in memory in advance 159 | 160 | ## Lazy functions for large datasets 161 | - `map` = lazy function = it doesn't evaluate when we call `map` 162 | - Python stores the instructions for evaluating the function and runs them at the exact moment we ask for the value 163 | - Common lazy objects in Python = `range` function 164 | - Lazy `map` allows us to transform a lot of data without an unnecessarily large amount of memory or spending the time to generate it 165 | 166 | ## Parallel processing 167 | ### Problems 168 | #### Inability to pickle data or functions 169 | - *Pickling*: Python's version of object serialization or mashalling 170 | - Storing objects from our code in an efficient binary format on the disk that can be read back by our program at a later time (`pickle` module) 171 | - allows us to share data across procesors or even machines, saving the instructions and data and then executing them elsewhere 172 | - Objects we can't pickle: lambda functions, nested functions, nested classes 173 | - `pathos` and `dill` module allows us to pickle almost anything 174 | 175 | #### Order-sensitive operations 176 | - Work in parallel: not guaranteed that tasks will be finished in the same order they're input 177 | - If work needs to be processed in a linear order -> probably shouldn't do it in parallel 178 | - Even though Python may not complete the problems in order, it still remembers the order in which it was supposed to do them -> `map` returns in the exact order we would expect, even if it doesn't process in that order 179 | 180 | #### State-dependent operations 181 | - Common solution for the state problem: **take the internal state and make it an external variable** 182 | 183 | ## Other observations 184 | - Best way to flatten a list into one big list -> Python's itertools `chain` function: takes an iterable of iterables and chains them together so they can all be accessed one after another -> lazy by default 185 | - Best way to visualize graphs is to take it out of Python and import it into Gephi: dedicated piece of graph visualization software 186 | 187 | > Anytime we're converting a sequence of some type into a sequence of another type, what we're doing can be expressed as a map -> N-to-N transformation: we're converting N data elements, into N data elements but in different format 188 | 189 | - To make this type of problem parallel only adds up to few lines of code: 190 | - one import 191 | - wrangling our processors with `Pool()` 192 | - modifying our `map` statements to use `Pool.map` method 193 | 194 | # Ch3. Function pipelines for mapping complex transformations 195 | 196 | ## Helper functions and function chains 197 | **Helper functions**: small, simple functions that we rely on to do complex things -> break down large problems into small pieces that we can code quickly 198 | 199 | **Function chains** or **pipelines**: the way we put helper functions to work 200 | 201 | ### Creating a pipeline 202 | - Chaining helper functions together 203 | - Ways to do this: 204 | - Using a sequence of maps 205 | - Chaining functions together with `compose` 206 | - Creating a function pipeline with `pipe` 207 | - `compose` and `pipe` are functions in the `toolz` package 208 | 209 | #### Compose 210 | ``` 211 | from toolz.functoolz import compose 212 | ``` 213 | 214 | - Pass `compose` all the functions we want to include in our pipeline 215 | - Pass in **reverse order** because `compose` is going to apply them backwards 216 | - Store the output of our `compose` function, which is itself a function, to a variable 217 | - Call that variable or pass it along to `map` 218 | 219 | #### Pipe 220 | ``` 221 | from toolz.functoolz import pipe 222 | ``` 223 | 224 | - `pipe` function will pass a value through a pipeline 225 | - `pipe` expects the functions to be in the order we want to apply them 226 | - `pipe` evaluates each of the functions and returns a results 227 | - If we want to pass it to `map`, we have to wrap it in a function definition 228 | 229 | ## Summary 230 | > Major advantages of creating pipelines of helper functions are that the code becomes: **Readable and clear; Modular and easy to edit** 231 | 232 | - Modular code play very nice with `map` and can readily move into parallel workflows, such as by using the `Pool()` 233 | - We can simplify working with nested data structures by using nested function pipelines, which we can apply with `map` 234 | 235 | # Ch4. Processing large datasets with lazy workflows 236 | ## Laziness 237 | - *Lazy evaluation*: strategy when deciding when to perform computations 238 | - Under lazy evaluation, the Python interpreter executes lazy Python code only when the program needs the results of that code 239 | - Opposite of *eager evaluation*, where everything is evaluated when it's called 240 | 241 | ### Shrinking sequences with the filter function 242 | - `filter`: function for pruning sequences. 243 | - Takes a sequence and restricts it to only the elements that meet a given condition 244 | - Related functions to know 245 | - `itertools.filterfalse`: get all the results that make a qualifier function return `False` 246 | - `toolz.dicttoolz.keyfilter`: filter on the keys of a `dict` 247 | - `toolz.dicttoolz.valfilter`: filter on the values of a `dict` 248 | - `toolz.dicttoolz.itemfilter`: filter on both the keys and the values of a dict 249 | 250 | ### Combining sequences with zip 251 | - `zip`: function for merging sequences. 252 | - Takes two sequences and returns a single sequence of `tuples`, each of which contains an element from each of the original sequences 253 | - Behaves like a zipper, it interlocks the values of Python iterables 254 | 255 | ### Lazy file searching with iglob 256 | - `iglob`: function for lazily reading from the filesystem. 257 | - Lazy way of querying our filesystem 258 | - Find a sequence of files on our filesystem that match a given pattern 259 | 260 | ``` 261 | from glob import iglob 262 | posts = iglob("path/to/posts/2020/06/*.md") 263 | ``` 264 | 265 | ## Understanding iterators: the magic behind lazy Python 266 | - Replace data with instructions about where to find data and replace transformations with instructions for how to execute those transformations. 267 | - The computer only has to concern itself with the data it is processing right now, as opposed to the data it just processed or has to process in the future 268 | - Iterators are the base class of all the Python data types that can be iterated over 269 | 270 | > The iteration process is defined by a special method called `.__iter__()`. If a class has this method and returns an object with a `.__next__()` method, then we can iterate over it. 271 | 272 | - One-way streets: once we call `next`, the item returned is removed from the sequence. We can never back up or retrieve that item again 273 | - Not meant for by-hand inspection -> meant for processing big data 274 | 275 | ## Generators: functions for creating data 276 | - Class of functions in Python that lazily produce values in a sequence 277 | - We can create generators with functions using `yield` statements or through concise and powerful list comprehension-like generator expressions 278 | - They're a simple way of implementing an iterator 279 | - Primary advantage of generators and lazy functions: **avoiding storing more in memory than we need to** 280 | - `itertools.islice`: take chunks from a sequence 281 | 282 | > Lazy functions are great at processing data, but hardware still limits how quickly we can work through it 283 | 284 | - `toolz.frequencies`: takes a sequence in and returns a `dict` of items that occurred in the sequence as keys with corresponding values equal to the number of times they occurred -> provides the frequencies of items in our sequence 285 | 286 | ## Simulations 287 | - For simulations -> writing classes allow us to consolidate the data about each piece of the simulation 288 | - `itertools.count()`: returns a generator that produces an infinite sequence of increasing numbers 289 | - Unzipping = the opposite of zipping -> takes a single sequence and returns two -> unzip = `zip(*my_sequence)` 290 | 291 | > `operator.methodcaller`: takes a string and returns a function that calls that method with the name of that string on any object passed to it -> call class methods using functions is helpful = allows us to use functions like `map` and `filter` on them 292 | 293 | # Ch5. Accumulation operations with reduce 294 | - `reduce`: function for N-to-X transformations 295 | - We have a sequence and want to transform it into something that we can't use `map` for 296 | - `map` can take care of the transformations in a very concise manner, whereas `reduce` can take care of the very final transformation 297 | 298 | ## Three parts of reduce 299 | - **Accumulator function** 300 | - **Sequence**: object that we can iterate through, such as lists, strings, and generators 301 | - **Initializer**: initial value to be passed to our accumulator (may be *optional*) -> use an initalizer not when we want to change the value of our data, but when we want to change the *type* of the data 302 | 303 | ``` 304 | from functools import reduce 305 | 306 | reduce(acc_fn, sequence, initializer) 307 | ``` 308 | 309 | ### Accumulator functions 310 | - Does the heavy lifting for `reduce` 311 | - Special type of helper function 312 | - Common prototype: 313 | - take an accumulated value and the next element in the sequence 314 | - return another object, typically of the same type as the accumulated value 315 | - **accumulator functions always needs to return a value** 316 | - Accumulator functions take two variables: one for the accumulated data (often designated as acc, left, or a), and one for the next element in the sequence (designated nxt, right, or b). 317 | 318 | ``` 319 | def my_add(acc, nxt): 320 | return acc + nxt 321 | 322 | # or, using lambda functions 323 | lambda acc, nxt: acc + nxt 324 | ``` 325 | 326 | ## Reductions 327 | - `filter` 328 | - `frequencies` 329 | 330 | ## Using map and reduce together 331 | > If you can decompose a problem into an N-to-X transformation, all that stands between you and a reduction that solves that problem is a well-crafted accumulation function 332 | 333 | - Using `map` and `reduce` pattern to decouple the transformation logic from the actual transformation itself: 334 | - leads to highly reusable code 335 | - with large datasets -> simple functions becomes paramount -> we may have to wait a long time to discover we made a small error 336 | 337 | ## Speeding up map and reduce 338 | > Using a parallel map can counterintuitively be slower than using a lazy map in map an reduce scenarios 339 | 340 | - We can always use parallelization at the `reduce` level instead of at the `map` level 341 | 342 | # Ch6. Speeding up map and reduce with advanced parallelization 343 | - Parallel `reduce`: use parallelization in the accumulation process instead of the transformation process 344 | 345 | ## Getting the most out of parallel map 346 | Parallel `map` will be slower than lazy `map` when: 347 | - we're going to iterate through the sequence a second time later in our workflow 348 | - size of the work done in each parallel instance is small compared to the overhead that parallelization imposes -> *chunksize*: size of the different pieces into which we break our tasks for parallel processing 349 | - Python makes *chunksize* available as an option -> vary according to the task at hand 350 | 351 | ### More parallel maps: `.imap` and `starmap` 352 | #### `.imap` 353 | - `.imap`: for lazy parallel mapping 354 | - use `.imap` method to work in parallel on very large sequences efficiently 355 | - Lazy and parallel? use the `.imap` and `.imap_unordered` methods of `Pool()` -> both methods return iterators instead of lists 356 | - `.imap_unordered`: behaves the same, except it doesn't necessarily put the sequence in the right order for our iterator 357 | 358 | #### `starmap` 359 | - use `starmap` to work with complex iterables, especially those we're likely to create using the `zip` function -> more than one single parameter (map's limitation) 360 | - `starmap` unpacks `tuples` as **positional parameters** to the function with which we're mapping 361 | - `itertools.starmap`: lazy function 362 | - `Pool().starmap`: parallel function 363 | 364 | ## Parallel reduce for faster reductions 365 | Parallel `reduce`: 366 | - break a problem into chunks 367 | - make no guarantees about order 368 | - need to pickle data 369 | - be finicky about stateful objects 370 | - run slower than its linear counterpart on small datasets 371 | - run faster than its linear counterpart on big datasets 372 | - require an accumulator function, some data, and an initial value 373 | - perform N-to-X transformations 374 | 375 | > Parallel reduce has six parameters: an accumulation function, a sequence, an initializer value, a map, a chunksize, and a combination function - three more than the standard reduce function 376 | 377 | Parallel `reduce` workflow: 378 | - break our problem into pieces 379 | - do some work 380 | - combine the work 381 | - return a result 382 | 383 | > With parallel `reduce` we trade the simplicity of always having the same combination function for the flexibility of more possible transformations 384 | 385 | Implementing parallel `reduce`: 386 | 1. Importing the proper classes and functions 387 | 2. Rounding up some processors 388 | 3. Passing our `reduce` function the right helper functions and variables 389 | 390 | - Python doesn't natively support parallel `reduce` -> `pathos` library 391 | - `toolz.fold` -> parallel `reduce` implementation 392 | 393 | > `toolz` library: functional utility library that Python never came with. High-performance version of the library = `CyToolz` 394 | 395 | # Ch7. Processing truly big datasets with Hadoop and Spark 396 | - **Hadoop**: set of tools that support distributed map and reduce style of programming through Hadoop MapReduce 397 | - **Spark**: analytics toolkit designed to modernize Hadoop 398 | 399 | ## Distributed computing 400 | - share tasks and data long-term across a network of computers 401 | - offers large benefits in speed when we can parallelize our work 402 | - challenges: 403 | - keeping track of all our data 404 | - coordinating our work 405 | 406 | > If we distribute our work prematurely, we’ll end up losing performance spending too much time talking between computers and processors. A lot of performance improvements at the high-performance limits of distributed computing revolve around **optimizing communication between machines** 407 | 408 | ## Hadoop five modules 409 | 1. *MapReduce*: way of dividing work into parallelizable chunks 410 | 2. *YARN*: scheduler and resource manager 411 | 3. *HDFS*: file system for Hadoop 412 | 4. *Ozone*: Hadoop extension for object storage and semantic computing 413 | 5. *Common*: set of utilities that are shared across the previous four modules 414 | 415 | ### YARN for job scheduling 416 | - Scheduling 417 | - Oversees all of the work that is being done 418 | - Acts as a final decision maker in terms of how resources should be allocated across the cluster 419 | - Application management (*node managers*): work at the node (single-machine) level to determine how resources should be allocated within that machine 420 | - *federation*: tie together resource managers in extremely high demand use cases where thousands of nodes are not sufficient 421 | 422 | ### The data storage backbone of Hadoop: HDFS 423 | Hadoop Distributed File System (HDFS) -> reliable, performant foundation for high-performance distributed computing (but with that comes complexity). Use cases: 424 | - process big datasets 425 | - be flexible in hardware choice 426 | - be protected against hardware failure 427 | 428 | > Moving code is faster than moving data 429 | 430 | ### MapReduce jobs using Python and Hadoop Streaming 431 | Hadoop MapReduce with Python -> Hadoop Streaming = utility for using Hadoop MapReduce with programming languages besides Java 432 | 433 | Hadoop natively supports compression data: .gz, .bz2, and .snappy 434 | 435 | ## Spark for interactive workflows 436 | Analytics-oriented data processing framework designed to take advantage of higher-RAM compute clusters. Advantages for Python programmers: 437 | - direct Python interface - `PySpark`: allows for us to interactively explore big data through a PySpark shell REPL 438 | - can query SQL databases directly (Java Database Connectivity - JDBC) 439 | - has a *DataFrame* API: rows-and-columns data structure familiar to `pandas` -> provides a convenience layer on top of the core Spark data object: the RDD (Resilient Distributed Dataset) 440 | - Spark has two high-performance data structures: RDDs, which are excellent for any type of data, and DataFrames, which are optimized for tabular data. 441 | 442 | Favor Spark over Hadoop when: 443 | - processing streaming data 444 | - need to get the task completed nearly instantaneously 445 | - willing to pay for high-RAM compute clusters 446 | 447 | ### PySpark for mixing Python and Spark 448 | PySpark: we can call Spark's Scala methods through Python just like we would a normal Python library 449 | 450 | # Ch8. Best practices for large data with Apache Streaming and mrjob 451 | Use Hadoop to process 452 | - lots of data fast: distributed parallelization 453 | - data that's important: low data loss 454 | - enormous amounts of data: petabyte scale 455 | 456 | Drawbacks 457 | - To use Hadoop with Python -> Hadoop Streaming utility 458 | - Repeatedly read in string from `stdin` 459 | - Error messages for Java are not helpful 460 | 461 | ## Unstructured data: Logs and documents 462 | - Hadoop creators designed Hadoop to work on *unstructured data* -> data in the form of documents 463 | - Unstructured data is notoriously unwieldly =/= tabular data 464 | - But, is one of the most common forms of data around 465 | 466 | ## JSON for passing data between mapper and reducer 467 | - JavaScript Object Notation (JSON) 468 | - Data format used for moving data in plain text between one place and another 469 | - `json.dumps()` and `json.loads()` functions from Python's json library to achieve the transfer 470 | - Advantages: 471 | - easy for humans and machines to read 472 | - provides a number of useful basic data types (string, numeric, array) 473 | - emphasis on key-value pairs that aids the loose coupling of systems 474 | 475 | ## mrjob for pythonic Hadoop streaming 476 | - `mrjob`: Python library for Hadoop Streaming that focuses on cloud compatibility for truly scalable analysis 477 | - keeps the mapper and reducer steps but wraps them up in a single worker class named `mrjob` 478 | - `mrjob` versions of `map` and `reduce` share the same type signature, taking in keys and values and outputting keys and values 479 | - `mrjob` enforces JSON data exchange between the mapper and reducer phases, so we need to ensure that our output data is JSON serializable. 480 | 481 | # Ch9. PageRank with map and reduce in PySpark 482 | PySpark's RDD class methods: 483 | - `map`-like methods: replicate the function of `map` 484 | - `reduce`-like methods: replicate the function of `reduce` 485 | - *Convenience methods*: solve common problems 486 | 487 | > **Partitions** are the abstraction that RDDs use to implement parallelization. The data in an RDD is split up across different partitions, and each partition is handled in memory. It is common in large data tasks to partition an RDD by a key 488 | 489 | ### Map-like methods in PySpark 490 | - `.map` 491 | - `.flatMap` 492 | - `.mapValues` 493 | - `.flatMapValues` 494 | - `. mapPartitions` 495 | - `.mapPartitionsWithIndex` 496 | 497 | ### Reduce-like methods in PySpark 498 | - `.reduce` 499 | - `.fold` 500 | - `.aggregate` -> provides all the functionality of a parallel reduce. We can provide an initializer value, an aggregation function, and a combination function 501 | 502 | ### Convenience methods in PySpark 503 | Many of these mirror functions in `functools`, `itertools` and `toolz`. Some examples: 504 | - .countByKey() 505 | - .countByValue() 506 | - .distinct() 507 | - .countApproxDistinct() 508 | - .filter() 509 | - .first() 510 | - .groupBy() 511 | - .groupByKey() 512 | - .saveAsTextFile() 513 | - .take() 514 | 515 | #### Saving RDDs to text files 516 | Excellent for a few reasons: 517 | - The data is in a human-readable, persistent format. 518 | - We can easily read this data back into Spark with the `.textFile` method of `SparkContext`. 519 | - The data is well structured for other parallel tools, such as Hadoop’s MapReduce. 520 | - We can specify a compression format for efficient data storage or transfer. 521 | 522 | > RDD `.aggregate` method—returns a dict. We need an RDD so that we can take advantage of Spark’s parallelization. To get an RDD, we’ll need to explicitly convert the items of that dict into an RDD using the `.parallelize` method from our SparkContext: `sc`. 523 | 524 | - Spark programs often use \ characters in their method chaining to increase their readability 525 | - Using the `byKey` variations of methods in PySpark often results in significant speed-ups because like data is worked on by the same distributed compute worker 526 | 527 | # Ch10. Faster decision-making with machine learning and PySpark 528 | One of the reasons why Spark is so popular = built-in machine learning capabilities 529 | 530 | PySpark’s machine learning capabilities live in a package called `ml`. This package itself contains a few different modules categorizing some of the core machine learning capabilities, including 531 | - `pyspark.ml.feature` — For feature transformation and creation 532 | - `pyspark.ml.classification` — Algorithms for judging the category in which a data point belongs 533 | - `pyspark.ml.tuning` — Algorithms for improving our machine learners 534 | - `pyspark.ml.evaluation` — Algorithms for evaluating machine leaners 535 | - `pyspark.ml.util` — Methods of saving and loading machine learners 536 | 537 | > PySpark’s machine learning features expect us to have our data in a PySpark `DataFrame` object - not an `RDD`. The `RDD` is an abstract parallelizable data structure at the core of Spark, whereas the `DataFrame` is a layer on top of the `RDD` that provides a notion of rows and columns 538 | 539 | ## Organizing the data for learning 540 | Spark's ml classifiers look for two columns in a `DataFrame`: 541 | - A `label` column: indicates the correct classification of the data 542 | - A `features` column: contains the features we're going to use to predict that label 543 | 544 | ## Auxiliary classes 545 | - PySpark's `StringIndexer`: transforms categorical data stored as category names (using strings) and indexes the names as numerical variables. `StringIndexer` indexes categories in order of frequency — from most common to least common. The most common category will be 0, the second most common category 1, and so on 546 | - Most data structures in Spark are immutable -> property of Scala (in which Spark is written) 547 | - Spark's ml only want one column name `features` -> PySpark's `VectorAssembler`: `Transformer` like `StringIndexer` -> takes some input column names and an output column name and has methods to return a new `DataFrame` that has all the columns of the original, plus the new column we want to add 548 | - The feature creation classes are `Transformer`-class objects, and their methods return new `DataFrames`, rather than transforming them in place 549 | 550 | ## Evaluation 551 | PySpark's `ml.evaluation` module: 552 | - `BinaryClassifierEvaluator` 553 | - `RegressionEvaluator` 554 | - `MulticlassClassificationEvaluator` 555 | 556 | ## Cross-validation in PySpark 557 | `CrossValidator` class: k-fold cross-validation, needs to be initialized with: 558 | - An *estimator* 559 | - A *parameter estimator* - `ParamGridBuilder` object 560 | - An *evaluator* 561 | 562 | # Ch11. Large datasets in the cloud with Amazon Web Services and S3 563 | S3 is the go-to service for large datasets: 564 | 1. *effectively unlimited storage capacity*. We never have to worry about our dataset becoming too large 565 | 2. *cloud-based*. We can scale up and down quickly as necessary. 566 | 3. *offers object storage*. We can focus on organizing our data with metadata and store many different types of data. 567 | 4. *managed service*. Amazon Web Services takes care of a lot of the details for us, such as ensuring data availability and durability. They also take care of security patches and software updates. 568 | 5. *supports versioning and life cycle policies*. We can use them to update or archive our data as it ages 569 | 570 | ## Objects for convenient heterogenous storage 571 | - Object storage: storage pattern that focuses on the **what of the data instead of the where** 572 | - With object storage we recognize objects by a unique identifier (instead of the name and directory) 573 | - Supports arbitrary metadata -> we can tag our objects flexibly based on our needs (helps us find those objects later when we need to use them) 574 | - Querying tools are available for S3 that allow SQL-like querying on these metadata tags for metadata analysis 575 | - Unique identifiers -> we can store heterogenous data in the same way 576 | 577 | ## Parquet: A concise tabular data store 578 | - CSV is a simple, tabular data store, and JSON is a human-readable document store. Both are common in data interchange and are often used in the storage of distributed large datasets. Parquet is a Hadoop- native tabular data format. 579 | - Parquet uses clever metadata to improve the performance of map and reduce operations. Running a job on Parquet can take as little as 1/100th the time a comparable job on a CSV or JSON file would take. Additionally, Parquet supports efficient compression. As a result, it can be stored at a fraction of the cost of CSV or JSON. 580 | - These benefits make Parquet an excellent option for data that primarily needs to be read by a machine, such as for batch analytics operations. JSON and CSV remain good options for smaller data or data that's likely to need some human inspection. 581 | 582 | > Boto is a library that provides Pythonic access to many of the AWS APIs. We need the access key and secret key to programmatically access AWS through boto 583 | 584 | # Ch12. MapReduce in the cloud with Amazon's Elastic MapReduce 585 | ## Convenient cloud clusters with EMR 586 | Ways to get access to a compute cluster that support both Hadoop and Spark: 587 | - AWS: Amazon's Elastic MapReduce 588 | - Microsoft's Azure HDInsight 589 | - Google's Cloud Dataproc 590 | 591 | ### AWS EMR 592 | - AWS EMR is a managed data cluster service 593 | - We specify general properties of the cluster, and AWS runs software that creates the cluster for us 594 | - When we're done using the cluster, AWS absorbs the compute resources back into its network 595 | - Pricing model is a per-compute-unit per-second charge 596 | - **There are no cost savings to doing things slowly. AWS encourages us to parallelize our problems away** 597 | 598 | #### Starting EMR clusters with mrjob 599 | - We can run Hadoop jobs on EMR with the `mrjob` library, which allows us to write distributed MapReduce and procure cluster computing in Python. 600 | - We can use `mrjob`'s configuration files to describe what we want our clusters to look like, including which instances we’d like to use, where we’d like those instances to be located, and any tags we may want to add. 601 | 602 | > Hadoop on EMR is excellent for large data processing workloads, such as batch analytics or extract-transform-load (ETL) 603 | 604 | ## Machine learning in the cloud with Spark on EMR 605 | - Hadoop is great for low-memory workloads and massive data. 606 | - Spark is great for jobs that are harder to break down into map and reduce steps, and situations where we can afford higher memory machines 607 | 608 | ### Running machine learning algorithms on a truly large dataset 609 | 1. Get a sample of the full dataset. 610 | 2. Train and evaluate a few models on that dataset. 611 | 3. Select some models to evaluate on the full dataset. 612 | 4. Train several models on the full dataset in the cloud. 613 | 614 | > Run your Spark code with `spark-submit` utility instead of Python. The `spark-submit` utility queues up a Spark job, which will run in parallel locally and simulate what would happen if you ran the program on an active cluster 615 | 616 | ### EC2 instance types and clusters 617 | - `M-series`: use for Hadoop and for testing Spark jobs 618 | - `C-series`: compute-heavy workloads such as Spark analytics, Batch Spark jobs 619 | - `R-series`: high-memory, use for streaming analytics 620 | 621 | ### Software available on EMR 622 | - JupyterHub: cluster-ready version of Jupyter Notebook -> run interactive Spark and Hadoop jobs from a notebook environment 623 | - Hive: compile SQL code to Hadoop MapReduce jobs 624 | - Pig: compile *Pig-latin* (SQL-like) commands to run Hadoop MapReduce jobs 625 | 626 | --------------------------------------------------------------------------------