├── .gitignore
├── README.md
├── architecture-patterns-python
    ├── abstraction.jpeg
    ├── arch_patterns_cover.jpg
    └── notes.md
├── clean-code
    ├── clean_code.jpeg
    └── notes.md
├── examples
    ├── fp_python.ipynb
    ├── map_reduce.ipynb
    ├── network_wikipedia.ipynb
    ├── parallel_benchmark.ipynb
    └── parallel_example.py
├── high-performance-python
    ├── compiler_options.png
    ├── cover.jpg
    └── notes.md
├── learning-python-design-patterns
    ├── cover.jpg
    └── notes.md
└── master-large-data-python
    ├── cover.png
    ├── map_reduce.png
    └── notes.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .ipynb_checkpoints
3 | .vscode


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # My notes on some books I read on Python
 2 | 
 3 | - High performance Python: Practical Performant Programming for Humans - Micha Gorelick, Ian Ozsvald [[open notes]](./high-performance-python/notes.md)
 4 | 
 5 | - Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code - John T. Wolohan [[open notes]](./master-large-data-python/notes.md)
 6 | 
 7 | - Clean Code: A Handbook of Agile Software Craftsmanship - Robert C. Martin [[open notes]](./clean-code/notes.md)
 8 | 
 9 | - Learning Python Design Patterns - Chetan Giridhar [[open notes]](./learning-python-design-patterns/notes.md)
10 | 
11 | - Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices - Bob Gregory, Harry Percival [[open notes]](./architecture-patterns-python/notes.md)
12 | 
13 | - Fluent Python - Luciano Ramalho
14 | 
15 | ![compiler_options](high-performance-python/compiler_options.png)
16 | 
17 | ![map_reduce](master-large-data-python/map_reduce.png)


--------------------------------------------------------------------------------
/architecture-patterns-python/abstraction.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/architecture-patterns-python/abstraction.jpeg


--------------------------------------------------------------------------------
/architecture-patterns-python/arch_patterns_cover.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/architecture-patterns-python/arch_patterns_cover.jpg


--------------------------------------------------------------------------------
/architecture-patterns-python/notes.md:
--------------------------------------------------------------------------------
  1 | # Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices
  2 | 
  3 | Authors: Bob Gregory, Harry Percival
  4 | 
  5 | 
  6 | [Available here](https://www.cosmicpython.com/)
  7 | 
  8 | ![arch_patterns_cover](arch_patterns_cover.jpg)
  9 | 
 10 | - [Architecture Patterns with Python: Enabling Test-Driven Development, Domain-Driven Design, and Event-Driven Microservices](#architecture-patterns-with-python-enabling-test-driven-development-domain-driven-design-and-event-driven-microservices)
 11 | - [Introduction](#introduction)
 12 |   - [Encapsulation and Abstractions](#encapsulation-and-abstractions)
 13 |   - [Layering](#layering)
 14 |   - [Dependency Inversion Principle (DIP)](#dependency-inversion-principle-dip)
 15 | - [Part I. Building an architecture to support domain modeling](#part-i-building-an-architecture-to-support-domain-modeling)
 16 | - [Ch1. Domain Modeling](#ch1-domain-modeling)
 17 |   - [What is a domain model?](#what-is-a-domain-model)
 18 |     - [Dataclasses are great for value objects](#dataclasses-are-great-for-value-objects)
 19 |     - [Value objects and entities](#value-objects-and-entities)
 20 | - [Ch2. Repository Pattern](#ch2-repository-pattern)
 21 |   - [Object-relational mappers (ORMs)](#object-relational-mappers-orms)
 22 |   - [Introducing the repository pattern](#introducing-the-repository-pattern)
 23 |   - [What is the trade-off?](#what-is-the-trade-off)
 24 | - [Ch3. A brief interlude: on coupling and abstractions](#ch3-a-brief-interlude-on-coupling-and-abstractions)
 25 | - [Ch4. Our first use case: flask API and service layer](#ch4-our-first-use-case-flask-api-and-service-layer)
 26 |   - [A typical service function](#a-typical-service-function)
 27 |   - [Service Layer vs Domain Service](#service-layer-vs-domain-service)
 28 | - [Ch5. TDD in high gear and low gear](#ch5-tdd-in-high-gear-and-low-gear)
 29 |   - [High and low gear](#high-and-low-gear)
 30 | - [Ch6. Unit of Work pattern](#ch6-unit-of-work-pattern)
 31 | - [Ch7. Aggregates and consistency boundaries](#ch7-aggregates-and-consistency-boundaries)
 32 |   - [Why not just run everything in a spreadsheet?](#why-not-just-run-everything-in-a-spreadsheet)
 33 |   - [Invariants, constraints and consistency](#invariants-constraints-and-consistency)
 34 |   - [What is an Aggregate?](#what-is-an-aggregate)
 35 |   - [Choosing an Aggregate](#choosing-an-aggregate)
 36 |   - [One Aggregate = One Repository](#one-aggregate--one-repository)
 37 |   - [Optimistic concurrency](#optimistic-concurrency)
 38 | - [Part II. Event-Driven Architecture](#part-ii-event-driven-architecture)
 39 | 
 40 | # Introduction
 41 | 
 42 | > A big ball of mud is the natural state of software in the same way that wilderness is the natural state of your garden. It takes energy and direction to prevent the colapse
 43 | 
 44 | ## Encapsulation and Abstractions
 45 | Encapsulating behavior by using abstractions is a powerful tool for making code more:
 46 | - expressive
 47 | - testable
 48 | - easier to maintain
 49 | 
 50 | > **Responsibility-driven design**: uses the words *roles* and *responsibilities* rather than *tasks*. Think about code in terms of behavior, rather than in terms of data or algorithms
 51 | 
 52 | ## Layering
 53 | - When one function, module, or object uses another, we say that the one *depends on* the other. These dependencies form a kind of network or graph
 54 | - **Layered architecture**: divide the code into discrete categories or roles, and introduce rules about which categories of code can call each other
 55 | 
 56 | ## Dependency Inversion Principle (DIP)
 57 | Formal definition:
 58 | 1. High-level modules should not depend on low-level modules. Both should depend on abstractions
 59 | 2. Abstractions should not depend on details. Instead, details should depend on abstractions
 60 | 
 61 | > *Depends on* doesn't mean *imports* or *calls*, necessarily, but rather a more general idea that one module *knows about* or *needs* another module
 62 | 
 63 | # Part I. Building an architecture to support domain modeling
 64 | 
 65 | > "Behavior should come first and drive our storage requirements."
 66 | 
 67 | # Ch1. Domain Modeling
 68 | ## What is a domain model?
 69 | - Domain model = business logic layer
 70 | - Domain: "The problem you're trying to solve"
 71 | - Model: a map of a process or phenomenon that captures a useful property
 72 | 
 73 | > **Domain-driven design (DDD)**: the most important thing about software is that it provides a useful model of a problem. If we get that model right, our software delivers value and makes new things possible
 74 | 
 75 | - Make sure to express rules in the business jargon (*ubiquitous language* in DDD terminology)
 76 | - "We could show this code to our nontechnical coworkers, and they would agree that this correctly describes the behavior of the system"
 77 | - Avoiding a ball of mud: stick rigidly to principles of encapsulation and careful layering
 78 | - `from typing import NewType`: wrap primitive types
 79 | 
 80 | ### Dataclasses are great for value objects
 81 | - Business concept that has data but no identity: choose to represent it using the *Value Object* pattern
 82 | - *Value Object*: any domain object that is uniquely identified by the data it holds (we usually make them immutable) -> `from dataclasses import dataclass`; `@dataclass(frozen=True)` decorator above class definition
 83 | - Dataclasses (or namedtuples) give us *value equality*
 84 | 
 85 | ### Value objects and entities
 86 | - Value object: any object that is identified only by its data and doesn't have a long-lived identity
 87 | - Entity: domain object that has long-lived identity
 88 | - Entities, unlike values, have *identity equality*: we can change their values, and they are still recognizably the same thing -> make this explicit in code by implementing `__eq__` (equality operator) on entities
 89 | - `__hash__` is the magic method Python uses to control the behavior of objects when you add them to sets or use them as dict keys
 90 | - **Python's magic methods let us use our models with idiomatic Python**
 91 | 
 92 | # Ch2. Repository Pattern
 93 | - A simplifying abstraction over data storage -> allow us to decouple our model layer from the data layer
 94 | - **Layered architecture**: aim to keep the layers separate, and to have each layer depend only on the one below it
 95 | - **Onion architecture**: think of our model as being on the "inside" and dependencies flowing inward to it
 96 | - **Dependency inversion principle**: high-level modules (the domain) should not depend on low-level ones (the infrastructure)
 97 | 
 98 | ## Object-relational mappers (ORMs)
 99 | - Bridge the conceptual gap between the world of objects and domain modeling and the world of databases and relational algebra
100 | - Gives us *persistence ignorance*: our domain model doesn't need to know anything about how data is loaded or persisted -> keep our domain clean of direct dependencies on particular database technologies
101 | 
102 | ## Introducing the repository pattern
103 | - The repository pattern is an abstraction over persistent storage
104 | - It hides the boring details of data access by pretending that all of our data is in memory
105 | 
106 | > We often just rely on Python's duck typing to enable abstractions. To a Pythonista, a repository is *any* object that has `add(thing)` and `get(id)` methods
107 | 
108 | ## What is the trade-off?
109 | - Write a few lines of code in our repository class each time we add a new domain object that we want to retrieve
110 | - In return we get a simple abstraction over our storage layer, which we control
111 | - Easy to fake out for unit tests
112 | 
113 | # Ch3. A brief interlude: on coupling and abstractions
114 | - **Coupled components**: when we're unable to change component A for fear of breaking component B
115 | - Locally coupling = good = high *cohesion* between the coupled elements
116 | - Globally coupling = nuisance = increases the risk/cost of changing our code
117 | 
118 | 
119 | ![abstraction](abstraction.jpeg)
120 | - **Abstraction**: protect us from change by hiding away the complex details of whatever system B does
121 | - We can change the arrows on the right without changing the ones on the left
122 | 
123 | > Try to write a simple implementation and then refactor toward better design
124 | 
125 | # Ch4. Our first use case: flask API and service layer
126 | 
127 | ## A typical service function
128 | 1. Fetch some objects from the repository
129 | 2. Make some checks or assertions about the request against the current state of the world
130 | 3. Call a domain service
131 | 4. If all is well, save/update any state changed
132 | 
133 | - Flask app responsibilities (standard web stuff):
134 |   - Per-request session management
135 |   - Parsing information out of POST parameters
136 |   - Response status codes
137 |   - JSON
138 | - All the orchestration logic is in the use case/service layer and the domain logic stays in the domain
139 | 
140 | ## Service Layer vs Domain Service
141 | - **Service Layer (application service)**: handle requests from the outside world and orchestrate an operation
142 | - **Domain Service**: logic that belongs in the domain model, but doesn't sit naturally inside a stateful entity of value object
143 | 
144 | # Ch5. TDD in high gear and low gear
145 | - The service layer helps us clearly define our use cases and the workflow for each
146 | - Tests: help us change our system fearlessly
147 | - Don't write too many tests against the domain model: when you need to change the codebase you may need to update several unit tests
148 | - Testing against the service layer: tests don't nteract directly with "private" methods/attributes on our model objects = easier to refactor them
149 | 
150 | > "Every line of code that we put in a test is like a **blob of glue**, holding the system in a particular shape. The more low-level tests we have, the harder it will be to change things."
151 | 
152 | - **"Listen to the code"**: when writing tests and find that the code is hard to use or some code smell = trigger to refactor and reconsider the design
153 | - To improve the design of the code we must delete "sketch" tests that are to tightly coupled to a particular implementation
154 | 
155 | ## High and low gear
156 | - When starting a new project or gnarly problem: write tests against the domain model = better feedback
157 | - When adding a new feature or fixing a bug (don't need to make extensive changes to the domain model): write tests against the services = lower coupling and higher coverage
158 | - **Shifting gears metaphor**
159 | - Mitigation: keep all domain dependencies in fixture functions
160 | 
161 | # Ch6. Unit of Work pattern
162 | - Abstraction over the idea of *atomic operations*
163 | - Allow us to fully decouple our service layer from the data layer
164 | - We can implement it using Python's **context manager**
165 | - Context manager: `__enter__` and `__exit__` are the two magic methods that execute when we enter the `with` block and when we exit it, respectively = setup and teardown phases
166 | 
167 | > **Don't mock what you don't own**: rule of thumb that forces us to build simple abstractions over messy subsystems
168 | 
169 | # Ch7. Aggregates and consistency boundaries
170 | ## Why not just run everything in a spreadsheet?
171 | **"CSV over SMTP"** architecture:
172 | - Low initial complexity
173 | - Do not to scale very well
174 | - Difficult to apply logic and maintain consistency
175 | 
176 | ## Invariants, constraints and consistency
177 | - **Constraint**: rule that restricts the possible states our model can get into
178 | - **Invariant**: defined a little more precisely as a condition that is always true
179 | - **Locks**: prevents two operations from happening simultaneously on the same row or same table
180 | 
181 | ## What is an Aggregate?
182 | - A domain object that contains other domain objects and lets us treat the whole collection as a single unit
183 | - The only way to modify the objects inside the aggregate is to load the whole thing, and to call methods on the aggregate itself
184 | 
185 | > "An Aggregate is a cluster of associated objects that we treat as a unit for the purpose of data changes" - Eric Evans, DDD blue book
186 | 
187 | ## Choosing an Aggregate
188 | - Draw a boundary around a small number of objects (smaller the better for performance)
189 | - Have to be consistent with one another
190 | - Give a good name
191 | 
192 | > **Bounded contexts**: reaction against attempts to capture entire businesses into a single model
193 | 
194 | ## One Aggregate = One Repository
195 | - **Repositories should only return aggregates**
196 | - Aggregates are the only entities that are publicly accessible to the outside world
197 | 
198 | ## Optimistic concurrency
199 | - Version numbers are one way to implement it
200 | - Another implementation: setting the Postgres transaction isolation level to SERIALIZABLE (severe performance cost)
201 | - **Optimistic** = default assumption is that everything will be fine when two users want to make changes to the database -> you need to explicitly handle the possibility of failures in the case of a clash (retry the failure operation from the beginning)
202 | - **Pessimistic** concurrency = assumption that two users are going to cause conflicts -> prevent conflict in all cases -> lock everything just to be safe -> don't need to think about handling failures because the databse will prevent them for you (you do need to think about deadlocks)
203 | 
204 | # Part II. Event-Driven Architecture


--------------------------------------------------------------------------------
/clean-code/clean_code.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/clean-code/clean_code.jpeg


--------------------------------------------------------------------------------
/clean-code/notes.md:
--------------------------------------------------------------------------------
  1 | # Clean Code: A Handbook of Agile Software Craftsmanship
  2 | 
  3 | Author: Robert C. Martin
  4 | 
  5 | [Available here](https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882)
  6 | 
  7 | ![clean_code](clean_code.jpeg)
  8 | 
  9 | - [Clean Code: A Handbook of Agile Software Craftsmanship](#clean-code-a-handbook-of-agile-software-craftsmanship)
 10 | - [Ch1. Clean Code](#ch1-clean-code)
 11 |   - [Reading vs. Writing](#reading-vs-writing)
 12 | - [Ch2. Meaningful Names](#ch2-meaningful-names)
 13 |   - [Use intention-revealing names](#use-intention-revealing-names)
 14 |   - [Avoid disinformation](#avoid-disinformation)
 15 |   - [Make meaningful distinctions](#make-meaningful-distinctions)
 16 |   - [Use pronounceable names](#use-pronounceable-names)
 17 |   - [Use searchable names](#use-searchable-names)
 18 |   - [Avoid encodings](#avoid-encodings)
 19 |   - [Avoid mental mapping](#avoid-mental-mapping)
 20 |   - [Class names](#class-names)
 21 |   - [Method names](#method-names)
 22 |   - [Don't be cute](#dont-be-cute)
 23 |   - [Pick one word per concept](#pick-one-word-per-concept)
 24 |   - [Don't pun](#dont-pun)
 25 |   - [Use solution domain names](#use-solution-domain-names)
 26 |   - [Use problem domain names](#use-problem-domain-names)
 27 |   - [Add meaningful context](#add-meaningful-context)
 28 |   - [Don't add gratuitous context](#dont-add-gratuitous-context)
 29 | - [Ch3. Functions](#ch3-functions)
 30 |   - [Small](#small)
 31 |     - [Blocks and Indenting](#blocks-and-indenting)
 32 |   - [Do one thing](#do-one-thing)
 33 |   - [One level of abstraction per function](#one-level-of-abstraction-per-function)
 34 |     - [The Stepdown rule](#the-stepdown-rule)
 35 |   - [Use descriptive names](#use-descriptive-names)
 36 |   - [Function arguments](#function-arguments)
 37 |   - [How do you write functions like this?](#how-do-you-write-functions-like-this)
 38 | - [Ch4. Comments](#ch4-comments)
 39 | - [Ch5. Formatting](#ch5-formatting)
 40 |   - [The newspaper metaphor](#the-newspaper-metaphor)
 41 |   - [Vertical formatting](#vertical-formatting)
 42 |   - [Horizontal formatting](#horizontal-formatting)
 43 | - [Ch6. Objects and Data Structures](#ch6-objects-and-data-structures)
 44 |   - [Data/Object anti-symmetry](#dataobject-anti-symmetry)
 45 |   - [Data transfer objects (DTO)](#data-transfer-objects-dto)
 46 |     - [Active records](#active-records)
 47 |   - [Objects](#objects)
 48 |   - [Data Structures](#data-structures)
 49 | - [Ch7. Error Handling](#ch7-error-handling)
 50 |   - [Write your `Try-Catch-Finally` statement first](#write-your-try-catch-finally-statement-first)
 51 |   - [Provide context with exceptions](#provide-context-with-exceptions)
 52 |   - [Define the normal flow](#define-the-normal-flow)
 53 | - [Ch8. Boundaries](#ch8-boundaries)
 54 | - [Ch9. Unit Tests](#ch9-unit-tests)
 55 |   - [The three laws of TDD](#the-three-laws-of-tdd)
 56 |   - [Keeping tests clean](#keeping-tests-clean)
 57 |   - [Clean tests](#clean-tests)
 58 |   - [F.I.R.S.T.](#first)
 59 | - [Ch10. Classes](#ch10-classes)
 60 |   - [The Single Responsibility Principle](#the-single-responsibility-principle)
 61 |   - [Cohesion](#cohesion)
 62 |   - [Organizing for change](#organizing-for-change)
 63 | - [Ch11. Systems](#ch11-systems)
 64 |   - [Separate constructing a system from using it](#separate-constructing-a-system-from-using-it)
 65 |   - [Separation of main](#separation-of-main)
 66 |     - [Factories](#factories)
 67 |     - [Dependency injection (DI)](#dependency-injection-di)
 68 |   - [Scaling up](#scaling-up)
 69 |   - [Test drive the system architecture](#test-drive-the-system-architecture)
 70 |   - [Optimize decision making](#optimize-decision-making)
 71 | - [Ch12. Emergence](#ch12-emergence)
 72 |   - [Simple design rule 1: runs all the tests](#simple-design-rule-1-runs-all-the-tests)
 73 |   - [Simple design rule 2-4: refactoring](#simple-design-rule-2-4-refactoring)
 74 |     - [No duplication](#no-duplication)
 75 |     - [Expressive](#expressive)
 76 |     - [Minimal classes and methods](#minimal-classes-and-methods)
 77 | - [Ch13. Concurrency](#ch13-concurrency)
 78 |   - [Why concurrency?](#why-concurrency)
 79 |   - [Myths and misconceptions](#myths-and-misconceptions)
 80 |   - [Concurrency defense principles](#concurrency-defense-principles)
 81 |     - [Know your execution models](#know-your-execution-models)
 82 |     - [Others](#others)
 83 | 
 84 | # Ch1. Clean Code
 85 | > "Most managers want good code, even when they are obsessing about the schedule (...) It's *your* job to defend the code with equal passion"
 86 | 
 87 | - Clean code is *focused*: each function, each class, each module exposes a single-minded attitude that remains entirely undistracted, and upolluted, by the surrounding details
 88 | - Code, without tests, is not clean. No matter how elegant it is, no matter how readable and accessible, if it hath not tests, it be unclean
 89 | - You will read it, and it will be pretty much what you expected. It will be obvious, simple, and compelling
 90 | 
 91 | ## Reading vs. Writing
 92 | - The ratio of time spent reading vs. writing is well over 10:1
 93 | - We are constantly reading old code as part of the effort to write new code
 94 | - **We want the reading of code to be easy, even if it makes the writing harder**
 95 | - You cannot write code if you cannot read the surrounding code
 96 | - If you want to go fast, get done quickly, if you want your code to be easy to write, make it easy to read
 97 | 
 98 | # Ch2. Meaningful Names
 99 | 
100 | ## Use intention-revealing names
101 | > Choosing good names takes time, but saves more than it takes. Take care with your names and change them when you find better ones
102 | 
103 | ## Avoid disinformation
104 | - Avoid leaving false clues that obscure the meaning of code
105 | - Avoid words whose entrenched meanings vary from our intended meaning
106 | 
107 | ## Make meaningful distinctions
108 | If names must be different, then they should also mean something different
109 | 
110 | ## Use pronounceable names
111 | - Humans are good at words
112 | - Words are, by definition, pronounceable
113 | 
114 | ## Use searchable names
115 | Single-letter names and numeric constants have a particular problem in that they are not easy to locate across a body of text
116 | 
117 | ## Avoid encodings
118 | Encoding type or scope information into names simply adds an extra burden of deciphering
119 | 
120 | ## Avoid mental mapping
121 | > Clarity is king
122 | 
123 | ## Class names
124 | - Classes and objects should have noun or noun phrase names
125 | - A class name should not be a verb
126 | 
127 | ## Method names
128 | Methods should have verb or verb phrase names
129 | 
130 | ## Don't be cute
131 | - Choose clarity over entertainment value
132 | - Say what you mean. Mean what you say
133 | 
134 | ## Pick one word per concept
135 | A consistent lexicon is a great boon to the programmers who must use your code
136 | 
137 | ## Don't pun
138 | Avoid using the same word for two purposes -> essentially a pun
139 | 
140 | ## Use solution domain names
141 | - People who read your code will be programmers
142 | - Use CS terms, algorithm names, pattern names, math terms
143 | 
144 | ## Use problem domain names
145 | - Separate solution and problem domain concepts
146 | - Code that has more to do with problem domain concepts should have names drawn from the problem domain
147 | 
148 | ## Add meaningful context
149 | Most names are not meaningful in and of themselves
150 | 
151 | ## Don't add gratuitous context
152 | - Shorter names are generally better than long ones, so long as they are clear
153 | - Add no more context to a name than is necessary
154 | 
155 | > Choosing good names requires good descriptive skills and a shared cultural background. This is a teaching issue rather than a technical, business, or management issue
156 | 
157 | # Ch3. Functions
158 | ## Small
159 | Functions should be small
160 | 
161 | ### Blocks and Indenting
162 | - Blocks within `if` statements, `else` statements, `while` statements should be on line long -> probably a function call
163 | - Keep the enclosing function small, adds documentary value
164 | - Functions should not be large enough to hold nested structures -> makes easier to read and understand
165 | 
166 | ## Do one thing
167 | > Functions should do one thing. They should do it well. They should do it only
168 | 
169 | - Reasons to write functions: decompose a larger concept (the name of the function) into a set of steps at the next level of abstraction
170 | - Functions that do one thing cannot be divided into sections
171 | 
172 | ## One level of abstraction per function
173 | - Once details are mixed with essential concepts, more details tend to accrete within the function
174 | 
175 | ### The Stepdown rule
176 | - We want code to be read like a top-down narrative
177 | - A set of TO paragraphs, each describing the current level of abstraction and referencing subsequent TO paragraphs at the next level down
178 | 
179 | ## Use descriptive names
180 | Ward's principle: *"You know you are working on clean code when each routine turns out to be pretty much what you expected"*
181 | 
182 | - Spend time choosing a name
183 | - You should try several different names and read the code with each in place
184 | 
185 | ## Function arguments
186 | Ideal number of arguments for a function:
187 | - zero (niladic)
188 | - one (monadic)
189 | - two (dyadic)
190 | - more than that should be avoided where possible
191 | 
192 | - **Arguments are hard from a testing point of view** -> test cases for all combinations of arguments
193 | - Output arguments are harder to understand than input arguments
194 | - **Passing a boolean into a function (flag arguments) is a terrible practice** -> loudly proclaiming that this function does more than one thing -> does one thing if the flag is true and another if the flag is false!
195 | - When a function seems to need more than two or three arguments, it is likely that some of those arguments ought to be wrapped into a class of their own -> When groups of variables are passed together, they are likely part of a concept that deserves a name of its own
196 | - Side effects are lies -> your functions promises to do one thing, but it also does other *hidden* things
197 | - Either your function should change the state of an object, or it should return some information about the object
198 | - **Prefer Exceptions to returing error codes**
199 | - Extract try/catch blocks into functions of their own
200 | - Functions should do one thing -> error handling is one thing
201 | - Don't repeat yourself -> duplication may be the root of all evil in software
202 | 
203 | ## How do you write functions like this?
204 | Writing software is like any other kind of writing
205 | 1. Get your thoughts down first
206 | 2. Massage it until it reads well
207 | 
208 | The first draft might be clumsy and disorganized, so you restructure it and refine it until it reads the way you want it to read
209 | 
210 | > Every system is built from a domain-specific language designed by the programmers to describe the system. Functions are the verbs of that language, and classes are the nouns.
211 | 
212 | # Ch4. Comments
213 | - Comments are always failures. We must have them because we cannot always figure out how to express ourselves without them, but their use is not a cause for celebration
214 | - Comments lie. Not always, and not intentionally, but too often
215 | - The older a comment is, and the farther away it is from the code it describes, the more likely it is to be wrong
216 | - **Truth can only be found in the code**
217 | - Explain your intent in code: **create a function that says the same thing as the comment you want to write**
218 | - A comment may be used to amplify the importance of something that may otherwise seem inconsequential
219 | - We have good source code control systems now. Those systems will remember the code for us. We don’t have to comment it out any more. Just delete the code
220 | - Short functions don't need much description -> well-chosen name for a small function that does one thing is better than a comment header
221 | 
222 | # Ch5. Formatting
223 | Code formatting
224 | - Too important to ignore
225 | - Is about communication -> developer's first order of business
226 | 
227 | > Small files are easier to understand than large files are
228 | 
229 | ## The newspaper metaphor
230 | Source file should be like a newspaper article
231 | - Name should be simple but explanatory
232 | - The name, by itself, should be sufficient to tell us whether we are in the right module or not
233 | 
234 | ## Vertical formatting
235 | - Avoid forcing the reader to hop around through the source files and classes
236 | - **Dependent functions**: if one function calls another, they should be vertically close, and the caller should be above the callee
237 | 
238 | ## Horizontal formatting
239 | - Strive to keep your lines short
240 | - Beyond 100~120 isn't advisable
241 | 
242 | # Ch6. Objects and Data Structures
243 | 
244 | ## Data/Object anti-symmetry
245 | > Objects hide their data behind abstractions and expose functions that operate on that data. Data structure expose their data and have no meaningful functions
246 | 
247 | - Procedural code (code using data structures) makes it easy to add new functions without changing the existing data structures. OO code makes it easy to add new classes without changing existing functions
248 | - Procedural code makes it hard to add new data structures because all the functions must change. OO code makes it hard to add new functions because all the classes must change
249 | 
250 | > Mature programmers know that the idea that *everything is an object is a myth*. Sometimes you really do want simple data structures with procedures operating on them
251 | 
252 | ## Data transfer objects (DTO)
253 | DTO: quintessential form of a data structure -> a class with public variables and no functions
254 | 
255 | ### Active records
256 | - Special forms of DTOs
257 | - Data structures with public (or bean-accessed) variables; but they typically have navigational methods like `save` and `find`
258 | 
259 | ## Objects 
260 | - expose behavior and hide data
261 | - easy to add new kinds of objects without changing existing behaviors 
262 | - hard to add new behaviors to existing objects 
263 | 
264 | ## Data Structures 
265 | - expose data and have no significant behavior
266 | - easy to add new behaviors to existing data structures
267 | - hard to add new data structures to existing functions
268 | 
269 | # Ch7. Error Handling
270 | > Things can go wrong, and when they do, we as programmers are responsible for making sure that our code what it needs to do
271 | 
272 | - Error handling is important, but if it obscures logic, it's wrong
273 | - It is better to throw an exception when you encounter an error. The calling code is cleaner. Its logic is not obscured by error handling
274 | 
275 | ## Write your `Try-Catch-Finally` statement first
276 | - `try` blocks are like transactions
277 | - Your `catch` has to leave your program in a consistent state, no matter what happens in the `try`
278 | - Try to write tests to force exceptions, and then add behavior to your handler to satisfy your tests -> cause you to build the transaction scope of the `try` block first and help maintain the transaction nature of that scope
279 | 
280 | ## Provide context with exceptions
281 | - Create informative error messages and pass them along with your exceptions
282 | - Mention the operation that failed and the type of failure
283 | - If you are logging in your application, pass along enough information to be able to log the error in your `catch`
284 | 
285 | > Wrapping third-party APIs is a best practice -> minimize your dependencies upon it: you can choose to move to a different library in the future without much penalty; makes it easier to mock out third-party calls when you are testing your own code
286 | 
287 | ## Define the normal flow
288 | **Special case pattern**: you create a class or configure an object so that it handles a special case for you -> the client code doesn't have to deal with exceptional behavior 
289 | 
290 | # Ch8. Boundaries
291 | - It's not our job to test the third-party code, but it may be in our best interest to write tests for the third-party code we use
292 | - **Learning tests**: call the third-party API, as we expect to use it in our application -> controlled experiments that check our understanding of that API
293 | - **Clean Boundaries**: code at the boundaries needs clear separation and tests that define expectations
294 | 
295 | > Avoid letting too much of our code know about the third-party particulars. It's betters to depend on something you control than on something you don't control, lest it end up controlling you
296 | 
297 | # Ch9. Unit Tests
298 | ## The three laws of TDD
299 | - **First Law**: You may not write production code until you have written a failing unit test
300 | - **Second Law**: You may not write more of a unit test than is sufficient to fail, and not compiling is failing
301 | - **Third Law**: You may not write more production code than is sufficient to pass the current failing test
302 | 
303 | ## Keeping tests clean
304 | - Having dirty tests is equivalent to, if not worse than, having no tests
305 | - Tests must change as the production code evolves -> the dirtier the tests, the harder they are to change
306 | - If your tests are dirty, you begin to lose the ability to improve the structure of that code
307 | > Test code is just as important as production code. It requires thought, design, and care. It must be kept as clean as production code
308 | 
309 | ## Clean tests
310 | Readability is perhaps even more important in unit tests than it is in production code
311 | - Clarity
312 | - Simplicity
313 | - Density of expression (say a lot with as few expressions as possible)
314 | 
315 | **BUILD-OPERATE-CHECK** pattern:
316 | - First part builds up the test data
317 | - Second part operates on that test data
318 | - Third part checks that the operation yielded the expected results
319 | 
320 | **Domain-Specific Testing Language**: testing language (specialized API used by the tests) -> make tests expressive and succint -> make the tests more convenient to write and easier to read
321 | 
322 | **given-when-then** convention: makes the tests even easier to read
323 | 
324 | **TEMPLATE METHOD** pattern -> putting the given/when parts in the base classs, and the then parts in different derivatives
325 | 
326 | - The number of asserts in a test ought to be minimized
327 | - We want to test a single concept in each test function
328 | 
329 | ## F.I.R.S.T.
330 | - **Fast**: when tests run slow, you won't want to run them frequently
331 | - **Independent**: you should be able to run each test independently and run the tests in any order you like
332 | - **Repeatable**: if your tests aren't repeatable in any environment, then you'll always have an excuse for why they fail
333 | - **Self-Validating**: you should not have to read through a log file to tell whether the tests pass (should have a boolean output -> pass/fail)
334 | - **Timely**: unit tests should be written just before the production code that makes them pass
335 | 
336 | # Ch10. Classes
337 | - Smaller is the primary rule when it comes to designing classes
338 | - Name of the class = describe what responsibilities it fulfills
339 | - If we cannot derive a concise name for a class, then it's likely too large -> the more ambiguous the class name, the more likely it has too many responsibilities
340 | 
341 | ## The Single Responsibility Principle
342 | - **SRP** is one of the more important concepts in OO design
343 | - States that a class or module should have one and only one, *reason to change*
344 | - Definition of responsibility
345 | - Guidelines for class size
346 | - A system with many small classes has no more moving parts than a system with a few large classes
347 | 
348 | > Trying to identify responsibilities (reasons to change) often helps us recognize and create better abstractions in our code
349 | 
350 | ## Cohesion
351 | - Classes should have a small number of instance variables
352 | - Each of the methods of a class should manipulate one or more of those variables
353 | - A class in which each variable is used by each method is **maximally cohesive**
354 | - Maintaining cohesion results in many small classes
355 | 
356 | ## Organizing for change
357 | - Change is continual
358 | - Every change -> risk that the remainder of the system no longer works as intended
359 | - Clean system -> organize our classes to reduce the risk of change
360 | 
361 | > **Open-Closed Principle (OCP)**: another key OO class design principle -> Classes should be open for extension but closed for modification
362 | 
363 | - Ideal system -> we incorporate new features by extending the system, not by making modifications to existing code
364 | 
365 | > **Dependency Inversion Principle (DIP)** -> classes should depend upon abstractions, not on concrete details
366 | 
367 | # Ch11. Systems
368 | ## Separate constructing a system from using it
369 | > Software systems should separate the startup process, when the application objects are constructed and the dependencies are "wired" together, from the runtime logic that takes over after startup
370 | 
371 | - Startup process: *concern* that any application must address
372 | - *Separation of concerns*: one of the most important design techniques
373 | - Never let little, convenient idioms lead to modularity breakdown
374 | 
375 | ## Separation of main
376 | ### Factories
377 | - **ABSTRACT FACTORY**: pattern -> give the application control of *when* to build the object, but keep the details of that construction separate from the application code
378 | 
379 | ### Dependency injection (DI)
380 | - Powerful mechanism for separating construction from use
381 | - Application of *Inversion of Control* (IoC) to dependency management
382 | - Moves secondary responsibilities from an object to other objects that are dedicated to the purpose (supporting SRP)
383 | - The invoking object doesn't control what kind of object is actually returned, but the invoking object still actively resolves the dependency
384 | > An object should not take responsibility for instantiating dependencies itself. Instead, it should pass this responsibility to another "authoritative" mechanism (inverting control). Setup is a global concern, this authoritative mechanism will be either the "main" routine or a special-purpose container 
385 | 
386 | ## Scaling up
387 | - **Myth**: we can get systems "right the first time"
388 | - Implement only today's stories -> then refactor and expand the system to implement new stories tomorrow = essence of iterative and incremental agility
389 | - TDD, refactoring, and the clean code they produce make this work at the code level
390 | - Software systems are unique compared to physical systems. Their archiectures can grow incrementally, **if we maintain the proper separation of concerns**
391 | 
392 | ## Test drive the system architecture
393 | - **Big Design Up Front (BDUF)**: harmful because it inhibits adapting to change, due to psychological resistance to discarding prior effort and because of the way architecture choices influence subsequent thinking about the design
394 | 
395 | ## Optimize decision making
396 | - Modularity and separation of concerns make decentralized management and decision making possible
397 | - Give responsibilities to the most qualified persons
398 | - **It is best to postpone decisions until the last possible moment** -> lets us make informed choices with the best possible information. A premature decision is a decision made with suboptimal knowledge
399 | 
400 | > Whether you are designing systems or individual modules, never forget to use **the simplest thing that can possibly work**
401 | 
402 | # Ch12. Emergence
403 | A design is "simple", if it follows these rules:
404 | - Run all the tests
405 | - Contains no duplication
406 | - Expresses the intent of the programmer
407 | - Minimizes the number of classes and methods
408 | 
409 | ## Simple design rule 1: runs all the tests
410 | - Systems that aren't testable aren't verifiable
411 | - A system that cannot be verified should never be deployed
412 | - Tight coupling makes it difficult to write tests
413 | - The more tests we write, the more we use principles like DIP and tools like dependency injection, interfaces, and abstraction to minimize coupling -> our designs improve even more
414 | - Primary OO goals -> low coupling and high cohesion
415 | 
416 | ## Simple design rule 2-4: refactoring
417 | For each few lines of code we add, we pause and reflect on the new design
418 | 
419 | ### No duplication
420 | - Duplication is the primary enemy of a well-designed system
421 | - It represents additional work, additional risk, and additional unnecessary complexity
422 | - TEMPLATE METHOD pattern: common technique for removing higher-level duplication
423 | 
424 | ### Expressive
425 | > It's easy to write code that *we* understand, because at the time we write it we're deep in an understanding of the problem we're trying to solve. Other maintainers of the code aren't going to have so deep an understanding
426 | 
427 | - Choose good names
428 | - Keep your functions and classes small
429 | - Use standard nomenclature
430 | - Tests primary goal = act as documentation by example
431 | - **The most important way to be expressive is to try. Care is a precious resource**
432 | 
433 | ### Minimal classes and methods
434 | - Effort to make our classes and methods small -> we might create too many tiny classes and methods -> **also keep our function and class counts low!**
435 | 
436 | > Although it's important to keep class and function count low, it's more important to have tests, eliminate duplication, and express yourself
437 | 
438 | # Ch13. Concurrency
439 | > Objects are abstractions of processing. Threads are abstractions of schedule - James O. Coplien
440 | 
441 | ## Why concurrency?
442 | - Concurrency is a decoupling strategy
443 | - Helps us decouple **what** gets done from **when** it gets done
444 | 
445 | ## Myths and misconceptions
446 | - Concurrency can *sometimes* improve performance, but only when there is a lot of wait time that can be shared between multiple threads or multiple processors
447 | - The design of a concurrent algorithm can be remarkably different from the design of a single-threaded system
448 | - **Concurrency bugs aren't usually repeatable**, so they are often ignored as one-offs instead of the true defects they are
449 | - Concurrency often requires a fundamental change in design strategy
450 | 
451 | ## Concurrency defense principles
452 | - **Single responsibility principle**: keep your concurrency-related code separate from other code
453 | - **Limit the scope of data**: data encapsulation; severely limit the access of any data that may be shared
454 | - **Use copies of data**
455 | - **Threads should be as independent as possible**
456 | 
457 | ### Know your execution models
458 | - **Producer-Consumer**
459 | - **Readers-Writers**
460 | - **Dining Philosophers**
461 | 
462 | ### Others
463 | - Keep synchronized sections small
464 | - Think about shut-down early and get it working early
465 | - Write tests that have the potential to expose problems and then run them frequently, with different programatic configurations and system configurations and load
466 | - Do not ignore system failures as one-offs
467 | - Do not try to chase down nonthreading bugs and threading bugs at the same time. Make sure your code works outside of threads
468 | - Make your thread-based code especially pluggable so that you can run it in various configurations
469 | - Run your threaded code on all target platforms early and often
470 | 
471 | > Code that is simple to follow can become nightmarish when multiple threads and shared data get into the mix -> you need to write clean code with rigor or else face subtle and infrequent failures
472 | 


--------------------------------------------------------------------------------
/examples/fp_python.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Functional Programming with Python\n",
  8 |     "\n",
  9 |     "## Problem -> Parse csv\n",
 10 |     "input\n",
 11 |     "```\n",
 12 |     "csv = \"\"\"firstName;lastName\n",
 13 |     "Ada;Lovelace\n",
 14 |     "Emmy;Noether\n",
 15 |     "Marie;Curie\n",
 16 |     "Tu;Youyou\n",
 17 |     "Ada;Yonath\n",
 18 |     "Vera;Rubin\n",
 19 |     "Sally;Ride\"\"\"\n",
 20 |     "```\n",
 21 |     "\n",
 22 |     "output\n",
 23 |     "```\n",
 24 |     "target = [{'firstName': 'Ada', 'lastName': 'Lovelace'},\n",
 25 |     "          {'firstName': 'Emmy', 'lastName': 'Noether'},\n",
 26 |     "          ...]\n",
 27 |     "```\n",
 28 |     "\n",
 29 |     "[source](https://www.youtube.com/watch?v=r2eZ7lhqzNE)"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 1,
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "csv = \"\"\"firstName;lastName\n",
 39 |     "Ada;Lovelace\n",
 40 |     "Emmy;Noether\n",
 41 |     "Marie;Curie\n",
 42 |     "Tu;Youyou\n",
 43 |     "Ada;Yonath\n",
 44 |     "Vera;Rubin\n",
 45 |     "Sally;Ride\"\"\""
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "### Imperative Python"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 2,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "lines = csv.split('\\n')\n",
 62 |     "matrix = [line.split(';') for line in lines]\n",
 63 |     "header = matrix.pop(0)\n",
 64 |     "records = []\n",
 65 |     "for row in matrix:\n",
 66 |     "    record = {}\n",
 67 |     "    for index, key in enumerate(header):\n",
 68 |     "        record[key] = row[index]\n",
 69 |     "    records.append(record)"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 3,
 75 |    "metadata": {},
 76 |    "outputs": [
 77 |     {
 78 |      "data": {
 79 |       "text/plain": [
 80 |        "[{'firstName': 'Ada', 'lastName': 'Lovelace'},\n",
 81 |        " {'firstName': 'Emmy', 'lastName': 'Noether'},\n",
 82 |        " {'firstName': 'Marie', 'lastName': 'Curie'},\n",
 83 |        " {'firstName': 'Tu', 'lastName': 'Youyou'},\n",
 84 |        " {'firstName': 'Ada', 'lastName': 'Yonath'},\n",
 85 |        " {'firstName': 'Vera', 'lastName': 'Rubin'},\n",
 86 |        " {'firstName': 'Sally', 'lastName': 'Ride'}]"
 87 |       ]
 88 |      },
 89 |      "execution_count": 3,
 90 |      "metadata": {},
 91 |      "output_type": "execute_result"
 92 |     }
 93 |    ],
 94 |    "source": [
 95 |     "records"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "markdown",
100 |    "metadata": {},
101 |    "source": [
102 |     "### Functional Python"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 4,
108 |    "metadata": {},
109 |    "outputs": [],
110 |    "source": [
111 |     "from toolz.curried import compose, map\n",
112 |     "from functools import partial\n",
113 |     "from operator import methodcaller\n",
114 |     "\n",
115 |     "split = partial(methodcaller, 'split')\n",
116 |     "split_lines = split('\\n')\n",
117 |     "split_fields = split(';')\n",
118 |     "dict_from_keys_vals = compose(dict, zip)\n",
119 |     "csv_to_matrix = compose(map(split_fields), split_lines)\n",
120 |     "\n",
121 |     "matrix = csv_to_matrix(csv)\n",
122 |     "keys = next(matrix)\n",
123 |     "records = map(partial(dict_from_keys_vals, keys), matrix)"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 5,
129 |    "metadata": {},
130 |    "outputs": [
131 |     {
132 |      "data": {
133 |       "text/plain": [
134 |        "<map at 0x11c12ab5f08>"
135 |       ]
136 |      },
137 |      "execution_count": 5,
138 |      "metadata": {},
139 |      "output_type": "execute_result"
140 |     }
141 |    ],
142 |    "source": [
143 |     "records"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": 6,
149 |    "metadata": {},
150 |    "outputs": [
151 |     {
152 |      "data": {
153 |       "text/plain": [
154 |        "[{'firstName': 'Ada', 'lastName': 'Lovelace'},\n",
155 |        " {'firstName': 'Emmy', 'lastName': 'Noether'},\n",
156 |        " {'firstName': 'Marie', 'lastName': 'Curie'},\n",
157 |        " {'firstName': 'Tu', 'lastName': 'Youyou'},\n",
158 |        " {'firstName': 'Ada', 'lastName': 'Yonath'},\n",
159 |        " {'firstName': 'Vera', 'lastName': 'Rubin'},\n",
160 |        " {'firstName': 'Sally', 'lastName': 'Ride'}]"
161 |       ]
162 |      },
163 |      "execution_count": 6,
164 |      "metadata": {},
165 |      "output_type": "execute_result"
166 |     }
167 |    ],
168 |    "source": [
169 |     "list(records)"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "code",
174 |    "execution_count": 30,
175 |    "metadata": {},
176 |    "outputs": [
177 |     {
178 |      "data": {
179 |       "text/plain": [
180 |        "{'firstName': ['firstName', 'lastName'], 'lastName': ['Ada', 'Lovelace']}"
181 |       ]
182 |      },
183 |      "execution_count": 30,
184 |      "metadata": {},
185 |      "output_type": "execute_result"
186 |     }
187 |    ],
188 |    "source": [
189 |     "from toolz.curried import pipe\n",
190 |     "\n",
191 |     "pipe(csv, csv_to_matrix, partial(dict_from_keys_vals, keys))"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "code",
196 |    "execution_count": 15,
197 |    "metadata": {},
198 |    "outputs": [
199 |     {
200 |      "data": {
201 |       "text/plain": [
202 |        "\u001b[1;31mSignature:\u001b[0m \u001b[0mpipe\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0mfuncs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
203 |        "\u001b[1;31mDocstring:\u001b[0m\n",
204 |        "Pipe a value through a sequence of functions\n",
205 |        "\n",
206 |        "I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``\n",
207 |        "\n",
208 |        "We think of the value as progressing through a pipe of several\n",
209 |        "transformations, much like pipes in UNIX\n",
210 |        "\n",
211 |        "``$ cat data | f | g | h``\n",
212 |        "\n",
213 |        ">>> double = lambda i: 2 * i\n",
214 |        ">>> pipe(3, double, str)\n",
215 |        "'6'\n",
216 |        "\n",
217 |        "See Also:\n",
218 |        "    compose\n",
219 |        "    compose_left\n",
220 |        "    thread_first\n",
221 |        "    thread_last\n",
222 |        "\u001b[1;31mFile:\u001b[0m      c:\\users\\fraga\\anaconda3\\lib\\site-packages\\toolz\\functoolz.py\n",
223 |        "\u001b[1;31mType:\u001b[0m      function\n"
224 |       ]
225 |      },
226 |      "metadata": {},
227 |      "output_type": "display_data"
228 |     }
229 |    ],
230 |    "source": [
231 |     "?pipe"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "code",
236 |    "execution_count": 12,
237 |    "metadata": {},
238 |    "outputs": [
239 |     {
240 |      "data": {
241 |       "text/plain": [
242 |        "\u001b[1;31mInit signature:\u001b[0m \u001b[0mpartial\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
243 |        "\u001b[1;31mDocstring:\u001b[0m     \n",
244 |        "partial(func, *args, **keywords) - new function with partial application\n",
245 |        "of the given arguments and keywords.\n",
246 |        "\u001b[1;31mFile:\u001b[0m           c:\\users\\fraga\\anaconda3\\lib\\functools.py\n",
247 |        "\u001b[1;31mType:\u001b[0m           type\n",
248 |        "\u001b[1;31mSubclasses:\u001b[0m     \n"
249 |       ]
250 |      },
251 |      "metadata": {},
252 |      "output_type": "display_data"
253 |     }
254 |    ],
255 |    "source": [
256 |     "?partial"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": 7,
262 |    "metadata": {},
263 |    "outputs": [
264 |     {
265 |      "data": {
266 |       "text/plain": [
267 |        "\u001b[1;31mSignature:\u001b[0m \u001b[0mcompose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0mfuncs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
268 |        "\u001b[1;31mDocstring:\u001b[0m\n",
269 |        "Compose functions to operate in series.\n",
270 |        "\n",
271 |        "Returns a function that applies other functions in sequence.\n",
272 |        "\n",
273 |        "Functions are applied from right to left so that\n",
274 |        "``compose(f, g, h)(x, y)`` is the same as ``f(g(h(x, y)))``.\n",
275 |        "\n",
276 |        "If no arguments are provided, the identity function (f(x) = x) is returned.\n",
277 |        "\n",
278 |        ">>> inc = lambda i: i + 1\n",
279 |        ">>> compose(str, inc)(3)\n",
280 |        "'4'\n",
281 |        "\n",
282 |        "See Also:\n",
283 |        "    compose_left\n",
284 |        "    pipe\n",
285 |        "\u001b[1;31mFile:\u001b[0m      c:\\users\\fraga\\anaconda3\\lib\\site-packages\\toolz\\functoolz.py\n",
286 |        "\u001b[1;31mType:\u001b[0m      function\n"
287 |       ]
288 |      },
289 |      "metadata": {},
290 |      "output_type": "display_data"
291 |     }
292 |    ],
293 |    "source": [
294 |     "?compose"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 13,
300 |    "metadata": {},
301 |    "outputs": [
302 |     {
303 |      "data": {
304 |       "text/plain": [
305 |        "\u001b[1;31mInit signature:\u001b[0m \u001b[0mmethodcaller\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
306 |        "\u001b[1;31mDocstring:\u001b[0m     \n",
307 |        "methodcaller(name, ...) --> methodcaller object\n",
308 |        "\n",
309 |        "Return a callable object that calls the given method on its operand.\n",
310 |        "After f = methodcaller('name'), the call f(r) returns r.name().\n",
311 |        "After g = methodcaller('name', 'date', foo=1), the call g(r) returns\n",
312 |        "r.name('date', foo=1).\n",
313 |        "\u001b[1;31mFile:\u001b[0m           c:\\users\\fraga\\anaconda3\\lib\\operator.py\n",
314 |        "\u001b[1;31mType:\u001b[0m           type\n",
315 |        "\u001b[1;31mSubclasses:\u001b[0m     \n"
316 |       ]
317 |      },
318 |      "metadata": {},
319 |      "output_type": "display_data"
320 |     }
321 |    ],
322 |    "source": [
323 |     "?methodcaller"
324 |    ]
325 |   }
326 |  ],
327 |  "metadata": {
328 |   "kernelspec": {
329 |    "display_name": "Python 3",
330 |    "language": "python",
331 |    "name": "python3"
332 |   },
333 |   "language_info": {
334 |    "codemirror_mode": {
335 |     "name": "ipython",
336 |     "version": 3
337 |    },
338 |    "file_extension": ".py",
339 |    "mimetype": "text/x-python",
340 |    "name": "python",
341 |    "nbconvert_exporter": "python",
342 |    "pygments_lexer": "ipython3",
343 |    "version": "3.7.4"
344 |   }
345 |  },
346 |  "nbformat": 4,
347 |  "nbformat_minor": 4
348 | }
349 | 


--------------------------------------------------------------------------------
/examples/map_reduce.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Map and Reduce style of programming with Python"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 64,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "import dill as pickle\n",
 17 |     "from random import choice\n",
 18 |     "from functools import reduce\n",
 19 |     "from pathos.multiprocessing import ProcessingPool as Pool\n",
 20 |     "from toolz.sandbox.parallel import fold\n",
 21 |     "from itertools import starmap"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": 61,
 27 |    "metadata": {},
 28 |    "outputs": [
 29 |     {
 30 |      "data": {
 31 |       "text/plain": [
 32 |        "\u001b[1;31mDocstring:\u001b[0m\n",
 33 |        "reduce(function, sequence[, initial]) -> value\n",
 34 |        "\n",
 35 |        "Apply a function of two arguments cumulatively to the items of a sequence,\n",
 36 |        "from left to right, so as to reduce the sequence to a single value.\n",
 37 |        "For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates\n",
 38 |        "((((1+2)+3)+4)+5).  If initial is present, it is placed before the items\n",
 39 |        "of the sequence in the calculation, and serves as a default when the\n",
 40 |        "sequence is empty.\n",
 41 |        "\u001b[1;31mType:\u001b[0m      builtin_function_or_method\n"
 42 |       ]
 43 |      },
 44 |      "metadata": {},
 45 |      "output_type": "display_data"
 46 |     }
 47 |    ],
 48 |    "source": [
 49 |     "?reduce"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 24,
 55 |    "metadata": {},
 56 |    "outputs": [
 57 |     {
 58 |      "name": "stdout",
 59 |      "output_type": "stream",
 60 |      "text": [
 61 |       "249\n"
 62 |      ]
 63 |     }
 64 |    ],
 65 |    "source": [
 66 |     "xs = [10, 5, 1, 19, 11, 203]\n",
 67 |     "\n",
 68 |     "def my_add(acc, nxt):\n",
 69 |     "    return acc + nxt\n",
 70 |     "\n",
 71 |     "print(reduce(my_add, xs, 0))"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 25,
 77 |    "metadata": {},
 78 |    "outputs": [
 79 |     {
 80 |      "name": "stdout",
 81 |      "output_type": "stream",
 82 |      "text": [
 83 |       "249\n"
 84 |      ]
 85 |     }
 86 |    ],
 87 |    "source": [
 88 |     "print(reduce(lambda acc, nxt: acc+nxt, xs, 0))"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 62,
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/plain": [
 99 |        "\u001b[1;31mInit signature:\u001b[0m \u001b[0mPool\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
100 |        "\u001b[1;31mDocstring:\u001b[0m     \n",
101 |        "Mapper that leverages python's multiprocessing.\n",
102 |        "    \n",
103 |        "\u001b[1;31mInit docstring:\u001b[0m\n",
104 |        "Important class members:\n",
105 |        "    nodes       - number (and potentially description) of workers\n",
106 |        "    ncpus       - number of worker processors\n",
107 |        "    servers     - list of worker servers\n",
108 |        "    scheduler   - the associated scheduler\n",
109 |        "    workdir     - associated $WORKDIR for scratch calculations/files\n",
110 |        "\n",
111 |        "Other class members:\n",
112 |        "    scatter     - True, if uses 'scatter-gather' (instead of 'worker-pool')\n",
113 |        "    source      - False, if minimal use of TemporaryFiles is desired\n",
114 |        "    timeout     - number of seconds to wait for return value from scheduler\n",
115 |        "        \n",
116 |        "NOTE: if number of nodes is not given, will autodetect processors\n",
117 |        "        \n",
118 |        "\u001b[1;31mFile:\u001b[0m           c:\\users\\fraga\\anaconda3\\lib\\site-packages\\pathos\\multiprocessing.py\n",
119 |        "\u001b[1;31mType:\u001b[0m           type\n",
120 |        "\u001b[1;31mSubclasses:\u001b[0m     \n"
121 |       ]
122 |      },
123 |      "metadata": {},
124 |      "output_type": "display_data"
125 |     }
126 |    ],
127 |    "source": [
128 |     "?Pool"
129 |    ]
130 |   },
131 |   {
132 |    "cell_type": "code",
133 |    "execution_count": 28,
134 |    "metadata": {},
135 |    "outputs": [
136 |     {
137 |      "name": "stdout",
138 |      "output_type": "stream",
139 |      "text": [
140 |       "6.15 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
141 |      ]
142 |     }
143 |    ],
144 |    "source": [
145 |     "%%timeit -r 1\n",
146 |     "with Pool() as P: \n",
147 |     "    fold(my_add, range(1000000), map=P.imap)"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": 29,
153 |    "metadata": {},
154 |    "outputs": [
155 |     {
156 |      "name": "stdout",
157 |      "output_type": "stream",
158 |      "text": [
159 |       "129 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n"
160 |      ]
161 |     }
162 |    ],
163 |    "source": [
164 |     "%%timeit -r 1\n",
165 |     "reduce(my_add, range(1000000))"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "### Speeding up map and reduce\n",
173 |     "> Using a parallel map can counterintuitively be slower than using a lazy map in map an reduce scenarios\n",
174 |     "\n",
175 |     "We can always use parallelization at the reduce level instead of at the map level"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": 30,
181 |    "metadata": {},
182 |    "outputs": [],
183 |    "source": [
184 |     "def map_combination(left, right):\n",
185 |     "  return left + right\n",
186 |     "\n",
187 |     "\n",
188 |     "def keep_if_even(acc, nxt):\n",
189 |     "    if nxt % 2 == 0:\n",
190 |     "        return acc + [nxt]\n",
191 |     "    else: \n",
192 |     "        return acc"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "code",
197 |    "execution_count": 63,
198 |    "metadata": {},
199 |    "outputs": [
200 |     {
201 |      "data": {
202 |       "text/plain": [
203 |        "\u001b[1;31mSignature:\u001b[0m\n",
204 |        "\u001b[0mfold\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m\n",
205 |        "\u001b[0m    \u001b[0mbinop\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
206 |        "\u001b[0m    \u001b[0mseq\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
207 |        "\u001b[0m    \u001b[0mdefault\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;34m'__no__default__'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
208 |        "\u001b[0m    \u001b[0mmap\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;33m<\u001b[0m\u001b[1;32mclass\u001b[0m \u001b[1;34m'map'\u001b[0m\u001b[1;33m>\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
209 |        "\u001b[0m    \u001b[0mchunksize\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;36m128\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
210 |        "\u001b[0m    \u001b[0mcombine\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\n",
211 |        "\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
212 |        "\u001b[1;31mDocstring:\u001b[0m\n",
213 |        "Reduce without guarantee of ordered reduction.\n",
214 |        "\n",
215 |        "inputs:\n",
216 |        "\n",
217 |        "``binop``     - associative operator. The associative property allows us to\n",
218 |        "                leverage a parallel map to perform reductions in parallel.\n",
219 |        "``seq``       - a sequence to be aggregated\n",
220 |        "``default``   - an identity element like 0 for ``add`` or 1 for mul\n",
221 |        "\n",
222 |        "``map``       - an implementation of ``map``. This may be parallel and\n",
223 |        "                determines how work is distributed.\n",
224 |        "``chunksize`` - Number of elements of ``seq`` that should be handled\n",
225 |        "                within a single function call\n",
226 |        "``combine``   - Binary operator to combine two intermediate results.\n",
227 |        "                If ``binop`` is of type (total, item) -> total\n",
228 |        "                then ``combine`` is of type (total, total) -> total\n",
229 |        "                Defaults to ``binop`` for common case of operators like add\n",
230 |        "\n",
231 |        "Fold chunks up the collection into blocks of size ``chunksize`` and then\n",
232 |        "feeds each of these to calls to ``reduce``. This work is distributed\n",
233 |        "with a call to ``map``, gathered back and then refolded to finish the\n",
234 |        "computation. In this way ``fold`` specifies only how to chunk up data but\n",
235 |        "leaves the distribution of this work to an externally provided ``map``\n",
236 |        "function. This function can be sequential or rely on multithreading,\n",
237 |        "multiprocessing, or even distributed solutions.\n",
238 |        "\n",
239 |        "If ``map`` intends to serialize functions it should be prepared to accept\n",
240 |        "and serialize lambdas. Note that the standard ``pickle`` module fails\n",
241 |        "here.\n",
242 |        "\n",
243 |        "Example\n",
244 |        "-------\n",
245 |        "\n",
246 |        ">>> # Provide a parallel map to accomplish a parallel sum\n",
247 |        ">>> from operator import add\n",
248 |        ">>> fold(add, [1, 2, 3, 4], chunksize=2, map=map)\n",
249 |        "10\n",
250 |        "\u001b[1;31mFile:\u001b[0m      c:\\users\\fraga\\anaconda3\\lib\\site-packages\\toolz\\sandbox\\parallel.py\n",
251 |        "\u001b[1;31mType:\u001b[0m      function\n"
252 |       ]
253 |      },
254 |      "metadata": {},
255 |      "output_type": "display_data"
256 |     }
257 |    ],
258 |    "source": [
259 |     "?fold"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 39,
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "name": "stdout",
269 |      "output_type": "stream",
270 |      "text": [
271 |       "996 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
272 |      ]
273 |     }
274 |    ],
275 |    "source": [
276 |     "%%timeit -r 1\n",
277 |     "with Pool() as P:\n",
278 |     "    fold(keep_if_even, range(100000), [],\n",
279 |     "         map=P.imap, combine=map_combination)"
280 |    ]
281 |   },
282 |   {
283 |    "cell_type": "code",
284 |    "execution_count": 22,
285 |    "metadata": {},
286 |    "outputs": [
287 |     {
288 |      "name": "stdout",
289 |      "output_type": "stream",
290 |      "text": [
291 |       "4.49 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
292 |      ]
293 |     }
294 |    ],
295 |    "source": [
296 |     "%%timeit -r 1\n",
297 |     "reduce(keep_if_even, range(100000), [])"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 56,
303 |    "metadata": {},
304 |    "outputs": [],
305 |    "source": [
306 |     "N = 100000\n",
307 |     "P = Pool()\n",
308 |     "xs = range(N)\n",
309 |     "\n",
310 |     "\n",
311 |     "# Parallel summation\n",
312 |     "def my_add(left, right):\n",
313 |     "  return left+right"
314 |    ]
315 |   },
316 |   {
317 |    "cell_type": "code",
318 |    "execution_count": 57,
319 |    "metadata": {},
320 |    "outputs": [
321 |     {
322 |      "name": "stdout",
323 |      "output_type": "stream",
324 |      "text": [
325 |       "611 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
326 |      ]
327 |     }
328 |    ],
329 |    "source": [
330 |     "%%timeit -r 1\n",
331 |     "fold(my_add, xs, map=P.imap)"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 58,
337 |    "metadata": {},
338 |    "outputs": [],
339 |    "source": [
340 |     "# Parallel filter\n",
341 |     "def map_combination(left, right):\n",
342 |     "  return left + right\n",
343 |     "\n",
344 |     "def keep_if_even(acc, nxt):\n",
345 |     "    if nxt % 2 == 0:\n",
346 |     "        return acc + [nxt]\n",
347 |     "    else: \n",
348 |     "        return acc"
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": 59,
354 |    "metadata": {},
355 |    "outputs": [
356 |     {
357 |      "name": "stdout",
358 |      "output_type": "stream",
359 |      "text": [
360 |       "981 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
361 |      ]
362 |     }
363 |    ],
364 |    "source": [
365 |     "%%timeit -r 1\n",
366 |     "fold(keep_if_even, xs, [], map=P.imap, combine=map_combination)"
367 |    ]
368 |   },
369 |   {
370 |    "cell_type": "code",
371 |    "execution_count": 43,
372 |    "metadata": {},
373 |    "outputs": [
374 |     {
375 |      "name": "stdout",
376 |      "output_type": "stream",
377 |      "text": [
378 |       "{1: 16586, 2: 16732, 3: 16727, 4: 16651, 5: 16683, 6: 16621}\n"
379 |      ]
380 |     }
381 |    ],
382 |    "source": [
383 |     "#Parallel frequencies\n",
384 |     "def combine_counts(left, right):\n",
385 |     "  unique_keys = set(left.keys()).union(set(right.keys()))\n",
386 |     "  return {k:left.get(k,0)+right.get(k,0) for k in unique_keys}\n",
387 |     "\n",
388 |     "def make_counts(acc, nxt):\n",
389 |     "    acc[nxt] = acc.get(nxt,0) + 1\n",
390 |     "    return acc\n",
391 |     "\n",
392 |     "xs = (choice([1,2,3,4,5,6]) for _ in range(N))"
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": 60,
398 |    "metadata": {},
399 |    "outputs": [
400 |     {
401 |      "name": "stdout",
402 |      "output_type": "stream",
403 |      "text": [
404 |       "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
405 |      ]
406 |     }
407 |    ],
408 |    "source": [
409 |     "%%timeit -r 1\n",
410 |     "fold(make_counts, xs, {}, map=P.imap, combine=combine_counts)"
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "code",
415 |    "execution_count": 65,
416 |    "metadata": {},
417 |    "outputs": [
418 |     {
419 |      "data": {
420 |       "text/plain": [
421 |        "\u001b[1;31mInit signature:\u001b[0m \u001b[0mstarmap\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m/\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
422 |        "\u001b[1;31mDocstring:\u001b[0m     \n",
423 |        "starmap(function, sequence) --> starmap object\n",
424 |        "\n",
425 |        "Return an iterator whose values are returned from the function evaluated\n",
426 |        "with an argument tuple taken from the given sequence.\n",
427 |        "\u001b[1;31mType:\u001b[0m           type\n",
428 |        "\u001b[1;31mSubclasses:\u001b[0m     \n"
429 |       ]
430 |      },
431 |      "metadata": {},
432 |      "output_type": "display_data"
433 |     }
434 |    ],
435 |    "source": [
436 |     "?starmap"
437 |    ]
438 |   },
439 |   {
440 |    "cell_type": "code",
441 |    "execution_count": 67,
442 |    "metadata": {},
443 |    "outputs": [
444 |     {
445 |      "name": "stdout",
446 |      "output_type": "stream",
447 |      "text": [
448 |       "[8, 3, 1, 19, 22]\n"
449 |      ]
450 |     }
451 |    ],
452 |    "source": [
453 |     "xs = [7, 3, 1, 19, 11]\n",
454 |     "ys = [8, 1, -3, 14, 22]\n",
455 |     "\n",
456 |     "loop_maxes = [max(ys[i], x) for i, x in enumerate(xs)]\n",
457 |     "map_maxes = list(starmap(max, zip(xs, ys)))\n",
458 |     "\n",
459 |     "print(loop_maxes)"
460 |    ]
461 |   },
462 |   {
463 |    "cell_type": "code",
464 |    "execution_count": 68,
465 |    "metadata": {},
466 |    "outputs": [
467 |     {
468 |      "name": "stdout",
469 |      "output_type": "stream",
470 |      "text": [
471 |       "[8, 3, 1, 19, 22]\n"
472 |      ]
473 |     }
474 |    ],
475 |    "source": [
476 |     "print(map_maxes)"
477 |    ]
478 |   }
479 |  ],
480 |  "metadata": {
481 |   "kernelspec": {
482 |    "display_name": "Python 3",
483 |    "language": "python",
484 |    "name": "python3"
485 |   },
486 |   "language_info": {
487 |    "codemirror_mode": {
488 |     "name": "ipython",
489 |     "version": 3
490 |    },
491 |    "file_extension": ".py",
492 |    "mimetype": "text/x-python",
493 |    "name": "python",
494 |    "nbconvert_exporter": "python",
495 |    "pygments_lexer": "ipython3",
496 |    "version": "3.7.4"
497 |   }
498 |  },
499 |  "nbformat": 4,
500 |  "nbformat_minor": 4
501 | }
502 | 


--------------------------------------------------------------------------------
/examples/parallel_benchmark.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {},
  7 |    "outputs": [
  8 |     {
  9 |      "name": "stdout",
 10 |      "output_type": "stream",
 11 |      "text": [
 12 |       "3.7.7\n"
 13 |      ]
 14 |     }
 15 |    ],
 16 |    "source": [
 17 |     "from platform import python_version\n",
 18 |     "\n",
 19 |     "print(python_version())"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {},
 26 |    "outputs": [
 27 |     {
 28 |      "data": {
 29 |       "text/plain": [
 30 |        "'1.18.0'"
 31 |       ]
 32 |      },
 33 |      "execution_count": 2,
 34 |      "metadata": {},
 35 |      "output_type": "execute_result"
 36 |     }
 37 |    ],
 38 |    "source": [
 39 |     "import fklearn\n",
 40 |     "\n",
 41 |     "fklearn.__version__"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 3,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "import numpy as np\n",
 51 |     "\n",
 52 |     "np.random.seed(42)"
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": 4,
 58 |    "metadata": {},
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "from fklearn.data.datasets import make_tutorial_data\n",
 62 |     "from fklearn.preprocessing.splitting import space_time_split_dataset\n",
 63 |     "from concurrent.futures import ThreadPoolExecutor"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "code",
 68 |    "execution_count": 5,
 69 |    "metadata": {},
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "TRAIN_START_DATE = '2015-01-01' \n",
 73 |     "TRAIN_END_DATE = '2015-03-01' \n",
 74 |     "HOLDOUT_END_DATE = '2015-04-01'\n",
 75 |     "\n",
 76 |     "split_fn = space_time_split_dataset(train_start_date=TRAIN_START_DATE,\n",
 77 |     "                                train_end_date=TRAIN_END_DATE,\n",
 78 |     "                                holdout_end_date=HOLDOUT_END_DATE,\n",
 79 |     "                                space_holdout_percentage=.5,\n",
 80 |     "                                split_seed=42, \n",
 81 |     "                                space_column=\"id\",\n",
 82 |     "                                time_column=\"date\")"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 6,
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "def measure_time_split_fn(sample_size):\n",
 92 |     "    result = %timeit -r 1 -o split_fn(data[:sample_size])\n",
 93 |     "    return result"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 7,
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "sample_sizes = [10000, 100000, 250000, 500000, 750000, 1000000]\n",
103 |     "data = make_tutorial_data(sample_sizes[-1])"
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": 8,
109 |    "metadata": {},
110 |    "outputs": [
111 |     {
112 |      "name": "stdout",
113 |      "output_type": "stream",
114 |      "text": [
115 |       "7.12 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)\n",
116 |       "26.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n",
117 |       "54.4 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n",
118 |       "102 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n",
119 |       "152 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n",
120 |       "204 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n"
121 |      ]
122 |     }
123 |    ],
124 |    "source": [
125 |     "old_function_times = []\n",
126 |     "\n",
127 |     "for sample_size in sample_sizes:\n",
128 |     "    time_old = %timeit -r 1 -o split_fn(data[:sample_size])\n",
129 |     "    old_function_times.append(time_old.best)"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": 9,
135 |    "metadata": {},
136 |    "outputs": [
137 |     {
138 |      "data": {
139 |       "text/plain": [
140 |        "[0.007119678000000001,\n",
141 |        " 0.026497800000000016,\n",
142 |        " 0.05441616999999992,\n",
143 |        " 0.10246849999999999,\n",
144 |        " 0.15187872999999996,\n",
145 |        " 0.2037241999999999]"
146 |       ]
147 |      },
148 |      "execution_count": 9,
149 |      "metadata": {},
150 |      "output_type": "execute_result"
151 |     }
152 |    ],
153 |    "source": [
154 |     "old_function_times"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": 10,
160 |    "metadata": {},
161 |    "outputs": [
162 |     {
163 |      "name": "stdout",
164 |      "output_type": "stream",
165 |      "text": [
166 |       "985 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
167 |       "1.01 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)939 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
168 |       "\n",
169 |       "947 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
170 |       "876 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
171 |       "149 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)\n"
172 |      ]
173 |     }
174 |    ],
175 |    "source": [
176 |     "with ThreadPoolExecutor() as executor:\n",
177 |     "        result = executor.map(measure_time_split_fn, sample_sizes)"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": 11,
183 |    "metadata": {},
184 |    "outputs": [
185 |     {
186 |      "data": {
187 |       "text/plain": [
188 |        "[<TimeitResult : 149 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)>,\n",
189 |        " <TimeitResult : 985 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>,\n",
190 |        " <TimeitResult : 1.01 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>,\n",
191 |        " <TimeitResult : 939 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>,\n",
192 |        " <TimeitResult : 947 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>,\n",
193 |        " <TimeitResult : 876 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>]"
194 |       ]
195 |      },
196 |      "execution_count": 11,
197 |      "metadata": {},
198 |      "output_type": "execute_result"
199 |     }
200 |    ],
201 |    "source": [
202 |     "list(result)"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "code",
207 |    "execution_count": 12,
208 |    "metadata": {},
209 |    "outputs": [],
210 |    "source": [
211 |     "import numpy as np\n",
212 |     "from numpy import nan\n",
213 |     "import pandas as pd\n",
214 |     "\n",
215 |     "\n",
216 |     "def new_make_tutorial_data(n: int) -> pd.DataFrame:\n",
217 |     "    \"\"\"\n",
218 |     "    Generates fake data for a tutorial. There are 3 numerical features (\"num1\", \"num3\" and \"num3\")\n",
219 |     "    and tow categorical features (\"cat1\" and \"cat2\")\n",
220 |     "    sex, age and severity, the treatment is a binary variable, medication and the response\n",
221 |     "    days until recovery.\n",
222 |     "    Parameters\n",
223 |     "    ----------\n",
224 |     "    n : int\n",
225 |     "        The number of samples to generate\n",
226 |     "    Returns\n",
227 |     "    ----------\n",
228 |     "    df : pd.DataFrame\n",
229 |     "        A tutorial dataset\n",
230 |     "    \"\"\"\n",
231 |     "    np.random.seed(1111)\n",
232 |     "\n",
233 |     "    dataset = pd.DataFrame({\n",
234 |     "        \"id\": list(map(lambda x: \"id%d\" % x, np.random.randint(0, 1000000, n))),\n",
235 |     "        \"date\": np.random.choice(pd.date_range(\"2015-01-01\", periods=100), n),\n",
236 |     "        \"feature1\": np.random.gamma(20, size=n),\n",
237 |     "        \"feature2\": np.random.normal(40, size=n),\n",
238 |     "        \"feature3\": np.random.choice([\"a\", \"b\", \"c\"], size=n)})\n",
239 |     "\n",
240 |     "    dataset[\"target\"] = (dataset[\"feature1\"]\n",
241 |     "                         + dataset[\"feature2\"]\n",
242 |     "                         + dataset[\"feature3\"].apply(lambda x: 0 if x == \"a\" else 30 if x == \"b\" else 10)\n",
243 |     "                         + np.random.normal(0, 5, size=n))\n",
244 |     "\n",
245 |     "    # insert some NANs\n",
246 |     "    dataset.loc[np.random.randint(0, n, 100), \"feature1\"] = nan\n",
247 |     "    dataset.loc[np.random.randint(0, n, 100), \"feature3\"] = nan\n",
248 |     "\n",
249 |     "    return dataset"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": 13,
255 |    "metadata": {},
256 |    "outputs": [],
257 |    "source": [
258 |     "# sample_sizes = [10000, 50000, 100000, 200000, 250000,]\n",
259 |     "sample_sizes = [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]\n",
260 |     "data = new_make_tutorial_data(sample_sizes[-1])"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": 14,
266 |    "metadata": {},
267 |    "outputs": [
268 |     {
269 |      "name": "stdout",
270 |      "output_type": "stream",
271 |      "text": [
272 |       "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
273 |       "2.45 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
274 |       "2.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
275 |       "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
276 |       "2.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
277 |       "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
278 |       "2.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)2.48 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n",
279 |       "\n"
280 |      ]
281 |     }
282 |    ],
283 |    "source": [
284 |     "with ThreadPoolExecutor() as executor:\n",
285 |     "        result = executor.map(measure_time_split_fn, sample_sizes)"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": null,
291 |    "metadata": {},
292 |    "outputs": [],
293 |    "source": []
294 |   }
295 |  ],
296 |  "metadata": {
297 |   "kernelspec": {
298 |    "display_name": "Python 3",
299 |    "language": "python",
300 |    "name": "python3"
301 |   },
302 |   "language_info": {
303 |    "codemirror_mode": {
304 |     "name": "ipython",
305 |     "version": 3
306 |    },
307 |    "file_extension": ".py",
308 |    "mimetype": "text/x-python",
309 |    "name": "python",
310 |    "nbconvert_exporter": "python",
311 |    "pygments_lexer": "ipython3",
312 |    "version": "3.7.7"
313 |   }
314 |  },
315 |  "nbformat": 4,
316 |  "nbformat_minor": 4
317 | }
318 | 


--------------------------------------------------------------------------------
/examples/parallel_example.py:
--------------------------------------------------------------------------------
 1 | import collections
 2 | from concurrent.futures import ThreadPoolExecutor
 3 | import os
 4 | import time
 5 | from pprint import pprint
 6 | 
 7 | 
 8 | def get_age(scientist):
 9 |     time.sleep(1)
10 |     print(f"Process {os.getpid()} working record {scientist.name}")
11 |     name_and_age = {
12 |         "name": scientist.name,
13 |         "age": int(time.strftime("%Y")) - scientist.born,
14 |     }
15 |     print(f"Process {os.getpid()} done processing record {scientist.name}")
16 | 
17 |     return name_and_age
18 | 
19 | 
20 | if __name__ == "__main__":
21 |     Scientist = collections.namedtuple("Scientist", ["name", "field", "born", "nobel",])
22 | 
23 |     scientists = (
24 |         Scientist(name="Ada Lovelace", field="math", born=1815, nobel=False),
25 |         Scientist(name="Emmy Noether", field="math", born=1882, nobel=False),
26 |         Scientist(name="Marie Curie", field="physics", born=1867, nobel=True),
27 |         Scientist(name="Tu Youyou", field="chemistry", born=1930, nobel=True),
28 |         Scientist(name="Ada Yonath", field="chemistry", born=1939, nobel=True),
29 |         Scientist(name="Vera Rubin", field="astronomy", born=1928, nobel=False),
30 |         Scientist(name="Sally Ride", field="physics", born=1951, nobel=False),
31 |     )
32 |     
33 |     print("\nSerial execution")
34 |     start = time.time()
35 |     result = tuple(map(get_age, scientists))
36 |     end = time.time()
37 |     pprint(result)
38 |     print(f"\nTime to complete: {end - start:.2f}s\n")
39 |     
40 | 
41 |     print("\nParallel execution")
42 |     start = time.time()
43 |     with ThreadPoolExecutor() as executor:
44 |         result = executor.map(get_age, scientists)
45 |     end = time.time()
46 |     pprint(tuple(result))
47 |     print(f"\nTime to complete: {end - start:.2f}s\n")
48 |     
49 | 


--------------------------------------------------------------------------------
/high-performance-python/compiler_options.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/high-performance-python/compiler_options.png


--------------------------------------------------------------------------------
/high-performance-python/cover.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/high-performance-python/cover.jpg


--------------------------------------------------------------------------------
/high-performance-python/notes.md:
--------------------------------------------------------------------------------
  1 | # High performance Python: Practical Performant Programming for Humans
  2 | 
  3 | Authors: Micha Gorelick, Ian Ozsvald
  4 | 
  5 | ![cover](cover.jpg)
  6 | 
  7 | - [High performance Python: Practical Performant Programming for Humans](#high-performance-python-practical-performant-programming-for-humans)
  8 | - [Ch1. Understanding Performant Python](#ch1-understanding-performant-python)
  9 |   - [Why use Python?](#why-use-python)
 10 |   - [How to be a highly performant programmer](#how-to-be-a-highly-performant-programmer)
 11 | - [Ch2. Profiling to Find Bottlenecks](#ch2-profiling-to-find-bottlenecks)
 12 |   - [cProfile module](#cprofile-module)
 13 |     - [Visualizing cProfile output with Snakeviz](#visualizing-cprofile-output-with-snakeviz)
 14 |   - [Using line_profiler for line-by-line measurements](#using-line_profiler-for-line-by-line-measurements)
 15 |   - [Using memory_profiler to diagnose memory usage](#using-memory_profiler-to-diagnose-memory-usage)
 16 |   - [Introspecting an existing process with PySpy](#introspecting-an-existing-process-with-pyspy)
 17 | - [Ch3. Lists and Tuples](#ch3-lists-and-tuples)
 18 | - [Ch4. Dictionaries and Sets](#ch4-dictionaries-and-sets)
 19 |   - [Complexity and speed](#complexity-and-speed)
 20 |   - [How do dictionaries and sets work?](#how-do-dictionaries-and-sets-work)
 21 | - [Ch5. Iterators and Generators](#ch5-iterators-and-generators)
 22 |   - [Python `for` loop deconstructed](#python-for-loop-deconstructed)
 23 |   - [Lazy generator evaluation](#lazy-generator-evaluation)
 24 | - [Ch6. Matrix and Vector Computation](#ch6-matrix-and-vector-computation)
 25 |   - [Memory fragmentation](#memory-fragmentation)
 26 |   - [numpy](#numpy)
 27 |   - [numexpr: making in-place operations faster and easier](#numexpr-making-in-place-operations-faster-and-easier)
 28 |   - [Lessons from matrix optimizations](#lessons-from-matrix-optimizations)
 29 |   - [Pandas](#pandas)
 30 |     - [Pandas's internal model](#pandass-internal-model)
 31 |   - [Building DataFrames and Series from partial results rather than concatenating](#building-dataframes-and-series-from-partial-results-rather-than-concatenating)
 32 |   - [Advice for effective pandas development](#advice-for-effective-pandas-development)
 33 | - [Ch7. Compiling to C](#ch7-compiling-to-c)
 34 |   - [Python offers](#python-offers)
 35 |   - [What sort of speed gains are possible?](#what-sort-of-speed-gains-are-possible)
 36 |   - [JIT versus AOT compilers](#jit-versus-aot-compilers)
 37 |   - [Why does type information help the code run faster?](#why-does-type-information-help-the-code-run-faster)
 38 |   - [Using a C compiler](#using-a-c-compiler)
 39 |   - [Cython](#cython)
 40 |   - [Numba](#numba)
 41 |   - [PyPy](#pypy)
 42 |   - [When to use each technology](#when-to-use-each-technology)
 43 |     - [Other upcoming projects](#other-upcoming-projects)
 44 |   - [Graphics Processing Units (GPUs)](#graphics-processing-units-gpus)
 45 |     - [Dynamic graphs: PyTorch](#dynamic-graphs-pytorch)
 46 |     - [Basic GPU profiling](#basic-gpu-profiling)
 47 |     - [When to use GPUs](#when-to-use-gpus)
 48 | - [Ch8. Asynchronous I/O](#ch8-asynchronous-io)
 49 |   - [Introduction to asynchronous programming](#introduction-to-asynchronous-programming)
 50 |   - [How does async/await work?](#how-does-asyncawait-work)
 51 |     - [Gevent](#gevent)
 52 |     - [tornado](#tornado)
 53 |     - [aiohttp](#aiohttp)
 54 |     - [Batched results](#batched-results)
 55 | - [Ch9. The multiprocessing module](#ch9-the-multiprocessing-module)
 56 |   - [Replacing multiprocessing with Joblib](#replacing-multiprocessing-with-joblib)
 57 |     - [Intelligent caching of function call results](#intelligent-caching-of-function-call-results)
 58 |     - [Using numpy](#using-numpy)
 59 |   - [Asynchronous systems](#asynchronous-systems)
 60 |   - [Interprocess Communication (IPC)](#interprocess-communication-ipc)
 61 |     - [multiprocessing.Manager()](#multiprocessingmanager)
 62 |     - [Redis](#redis)
 63 |   - [mmap](#mmap)
 64 | - [Ch10. Clusters and Job Queues](#ch10-clusters-and-job-queues)
 65 |   - [Benefits of clustering](#benefits-of-clustering)
 66 |   - [Drawbacks of clustering](#drawbacks-of-clustering)
 67 |   - [Parallel Pandas with Dask](#parallel-pandas-with-dask)
 68 |     - [Dask](#dask)
 69 |       - [Swifter](#swifter)
 70 |     - [Vaex](#vaex)
 71 |   - [NSQ for robust production clustering](#nsq-for-robust-production-clustering)
 72 | - [Ch11. Using less RAM](#ch11-using-less-ram)
 73 |   - [Objects for primitives are expensive](#objects-for-primitives-are-expensive)
 74 |     - [The `array` module stores many primitive objects cheaply](#the-array-module-stores-many-primitive-objects-cheaply)
 75 |     - [Using less RAM in NumPy with NumExpr](#using-less-ram-in-numpy-with-numexpr)
 76 |   - [Bytes versus Unicode](#bytes-versus-unicode)
 77 |   - [More efficient tree structures to represent strings](#more-efficient-tree-structures-to-represent-strings)
 78 |     - [Directed Acyclic Word Graph (DAWG)](#directed-acyclic-word-graph-dawg)
 79 |     - [Marisa Trie](#marisa-trie)
 80 |   - [Scikit-learn's DictVectorizer and FeatureHasher](#scikit-learns-dictvectorizer-and-featurehasher)
 81 |   - [ScyPy's Sparse Matrices](#scypys-sparse-matrices)
 82 |   - [Tips for using less RAM](#tips-for-using-less-ram)
 83 |   - [Probabilistic Data Structures](#probabilistic-data-structures)
 84 |     - [Morris counter](#morris-counter)
 85 |     - [K-Minimum values](#k-minimum-values)
 86 |     - [Bloom filters](#bloom-filters)
 87 | - [Ch12. Lessons from the field](#ch12-lessons-from-the-field)
 88 | 
 89 | > "Every programmer can benefit from understanding how to build performant systems (...) When something becomes ten times cheaper in time or compute costs, suddenly the set of applications you can address is wider than you imagined"
 90 | 
 91 | Supplemental material for the book (code examples, exercises, etc.) is available for download at https://github.com/mynameisfiber/high_performance_python_2e.
 92 | 
 93 | # Ch1. Understanding Performant Python
 94 | 
 95 | ## Why use Python?
 96 | - highly expressive and easy to learn
 97 | - `scikit-learn` wraps LIBLINEAR and LIBSVM (written in C)
 98 | - `numpy` includes BLAS and other C and Fortran libraries
 99 | - python code that properly utilizes these modules can be as fast as comparable C code
100 | - "batteries included"
101 | - enable fast prototyping of an idea
102 | 
103 | ## How to be a highly performant programmer
104 | Overall team velocity is far more important than speedups and complicated solutions. Several factors are key to this:
105 | - Good structure
106 | - Documentation
107 | - Debuggability
108 | - Shared standards
109 | 
110 | # Ch2. Profiling to Find Bottlenecks
111 | 
112 | Profiling let you make the most pragmatic decisions for the least overall effort: Code run "fast enough" and "lean enough"
113 | 
114 | > "If you avoid profiling and jump to optmization, you'll quite likely do more work in the long run. Always be driven by the results of profiling"
115 | 
116 | *"Embarrassingly parallel problem"*: no data is shared between points
117 | 
118 | `timeit` module temporarily disables the garbage collector
119 | 
120 | ## cProfile module
121 | Built-in profiling tool in the standard library
122 | 
123 | - `profile`: original and slower pure Python profiler
124 | - `cProfile`: same interface as `profile` and is written in `C` for a lower overhead
125 | 
126 | 1. Generate a *hypothesis* about the speed of parts of your code
127 | 2. Measure how wrong you are
128 | 3. Improve your intuition about certain coding styles
129 | 
130 | ### Visualizing cProfile output with Snakeviz
131 | `snakeviz`: visualizer that draws the output of `cProfile` as a diagram -> larger boxes are areas of code that take longer to run
132 | 
133 | ## Using line_profiler for line-by-line measurements
134 | `line_profilier`: strongest tool for identifying the cause of CPU-bound problems in Python code: profile individual functions on a line-by-line basis
135 | 
136 | Be aware of the complexity of **Python's dynamic machinery**
137 | 
138 | The order of evaluation for Python statements is both **left to right and opportunistic**: put the cheapest test on the left side of the equation
139 | 
140 | ## Using memory_profiler to diagnose memory usage
141 | `memory_profiler` measures memory usage on a line-by-line basis:
142 | - Could we use less RAM by rewriting this function to work more efficiently?
143 | - Could we use more RAM and save CPU cycles by caching?
144 | 
145 | **Tips**
146 | - Memory profiling make your code run 10-100x slower
147 | - Install `psutil` to `memory_profiler` run faster
148 | - Use `memory_profiler` occasionally and `line_profiler` more frequently
149 | - `--pdb-mmem=XXX` flag: `pdb` debugger is activate after the process exceeds XXX MB -> drop you in directly at the point in your code where too many allocations are occurring
150 | 
151 | ## Introspecting an existing process with PySpy
152 | `py-spy`: sampling profiler, don't require any code changes -> it introspects an already-running Python process and reports in the console with a *top-like* display
153 | 
154 | # Ch3. Lists and Tuples
155 | - **Lists**: dynamic arrays; mutable and allow for resizing
156 | - **Tuples**: static arrays; immutable and the data within them cannot be changed aftey they have been created
157 | - Tuples are cached by the Python runtime which means that we don't need to talk to the kernel to reserve memory every time we want to use one
158 | 
159 | Python lists have a built-in sorting algorithm that uses Tim sort -> O(n) in the best case and O(nlogn) in the worst case
160 | 
161 | Once sorted, we can find our desired element using a binary search -> average case of complexity of O(logn)
162 | 
163 | Dictionary lookup takes only O(1), but:
164 | - converting the data to a dictionary takes O(n)
165 | - no repeating keys may be undesirable
166 | 
167 | `bisect` module: provide alternative functions, heavily optimized
168 | 
169 | > "**Pick the right data structure and stick with it!** Although there may be more efficient data structures for particular operations, the cost of converting to those data structures may negate any efficiency boost"
170 | 
171 | - Tuples are for describing multiple properties of one unchanging thing
172 | - List can be used to store collections of data about completely disparate objects
173 | - Both can take mixed types
174 | 
175 | > "Generic code will be much slower than code specifically designed to solve a particular problem"
176 | 
177 | - Tuple (immutable): lightweight data structure
178 | - List (mutable): extra memory needed to store them and extra computations needed when using them
179 | 
180 | # Ch4. Dictionaries and Sets
181 | Ideal data structures to use when your data has no intrinsic order (except for insertion order), but does have a unique object that can be used to reference it
182 | - *key*: reference object
183 | - *value*: data
184 | 
185 | Sets do not actually contain values: is a collection of unique keys -> useful for doing set operations
186 | 
187 | **hashable** type: implements `__hash__` and either `__eq__` or `__cmp__`
188 | 
189 | ## Complexity and speed
190 | - O(1) lookups based on the arbitrary index
191 | - O(1) insertion time
192 | - Larger footprint in memory
193 | - Actual speed depends on the hashing function
194 | 
195 | ## How do dictionaries and sets work?
196 | Use *hash tables* to achieve O(1) lookups and insertions -> clever usage of a hash function to turn an arbitrary key (i.e., a string or object) into an index for a list
197 | 
198 | > *load factor*: how well distributed the data is throughout the hash table -> related to the entropy of the hash function
199 | 
200 | Hash functions must return integers
201 | 
202 | - Numerical types (`int` and `float`): hash is based on the bit value of the number they represent
203 | - Tuples and strings: hash value based on their contents
204 | - Lists: do not support hashing because their values can change
205 | 
206 | > A custom-selected hash function should be careful to evenly distribute hash values in order to avoid collisions (will degrade the performance of a hash table) -> constantly "probe" the other values -> worst case O(n) = searching through a list
207 | 
208 | **Entropy**: "how well distributed my hash function is" -> max entropy = *ideal* hash function = minimal number of collisions
209 | 
210 | # Ch5. Iterators and Generators
211 | 
212 | ## Python `for` loop deconstructed
213 | ```python
214 | # The Python loop
215 | for i in object:
216 |     do_work(i)
217 | 
218 | # Is equivalent to
219 | object_iterator = iter(object)
220 | while True:
221 |     try:
222 |         i = next(object_iterator)
223 |     except StopIteration:
224 |         break
225 |     else:
226 |         do_work(i)
227 | ```
228 | 
229 | - Changing to generators instead of precomputed arrays may require algorithmic changes (sometimes not so easy to understand)
230 | 
231 | > "Many of Python’s built-in functions that operate on sequences are generators themselves. `range` returns a generator of values as opposed to the actual list of numbers within the specified range. Similarly, `map`, `zip`, `filter`, `reversed`, and `enumerate` all perform the calculation as needed and don’t store the full result"
232 | 
233 | - Generators have less memory impact than list comprehension
234 | - Generators are really a way of organizing your code and having smarter loops
235 | 
236 | ## Lazy generator evaluation
237 | *Single pass* or *online* algorithms: at any point in our calculation with a generator, we have only the current value and cannot reference any other items in the sequence
238 | 
239 | `itertools` from the standard library provides useful functions to make generators easier to use:
240 | - `islice`: slicing a potentially infinite generator
241 | - `chain`: chain together multiple generators
242 | - `takewhile`: adds a condition that will end a generator
243 | - `cycle`: makes a finite generator infinite by constantly repeating it
244 | 
245 | # Ch6. Matrix and Vector Computation
246 | > Understanding the motivation behind your code and the intricacies of the algorithm will give you deeper insight about possible methods of optimization
247 | 
248 | ## Memory fragmentation
249 | Python doesn't natively support vectorization
250 | - Python lists store pointers to the actual data -> good because it allows us to store whatever type of data inside a list, however when it comes to vector and matrix operations, this is a source of performance degradation
251 | - Python bytecode is not optimized for vectorization -> `for` loops cannot predict when using vectorization would be benefical
252 | 
253 | *von Neumann bottleneck*: limited bandwidth between memory and CPU as a result of the tiered memory architecture that modern computers use
254 | 
255 | `perf` Linux tool: insights into how the CPU is dealing with the program being run
256 | 
257 | `array` object is less suitable for math and more suitable for storing fixed-type data more efficiently in memory
258 | 
259 | ## numpy
260 | `numpy` has all of the features we need—it stores data in contiguous chunks of memory and supports vectorized operations on its data. As a result, any arithmetic we do on `numpy` arrays happens in chunks without us having to explicitly loop over each element. Not only is it much easier to do matrix arithmetic this way, but it is also faster
261 | 
262 | Vectorization from `numpy`: may run fewer instructions per cycle, but each instruction does much more work
263 | 
264 | ## numexpr: making in-place operations faster and easier
265 | - `numpy`'s optimization of vector operations: occurs on only one operation at a time
266 | - `numexpr` is a module that can take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. Expressions can utilize multiple CPU cores
267 | - Easy to change code to use `numexpr`: rewrite the expressions as strings with references to local variables
268 | 
269 | ## Lessons from matrix optimizations
270 | Always take care of any administrative things the code must do during initialization
271 | - allocating memory
272 | - reading a configuration from a file
273 | - precomputing values that will be needed throughout the lifetime of a program
274 | 
275 | ## Pandas
276 | ### Pandas's internal model
277 | - Operations on columns often generate temporary intermediate arrays which consume RAM: expect a temporary memory usage of up to 3-5x your current usage
278 | - Operations can be single-threaded and limited by Python's global interpreter lock (GIL)
279 | - Columns of the same `dtype` are grouped together by a `BlockManager` -> make row-wise operations on columns of the same datatype faster
280 | - Operations on data of a single common block -> *view*; different `dtypes` -> can cause a *copy* (slower)
281 | - Pandas uses a mix of NumPy datatypes and its own extension datatypes
282 | - numpy `int64` isn't NaN aware -> Pandas `Int64` uses two columns of data: integers and NaN bit mask
283 | - numpy `bool` isn't NaN aware -> Pandas `boolean`
284 | 
285 | > More safety makes things run slower (checking passing appropriate data) -> **Developer time (and sanity) x Execution time**. Checks enabled: avoid painful debugging sessions, which kill developer productivity. If we know that our data is of the correct form for our chosen algorithm, these checks will add a penalty
286 | 
287 | ## Building DataFrames and Series from partial results rather than concatenating
288 | - Avoid repeated calls to `concat` in Pandas (and to the equivalent `concatenate` in NumPy)
289 | - Build lists of intermediate results and then construct a Series or DataFrame from this list, rather than concatenating to an existing object
290 | 
291 | ## Advice for effective pandas development
292 | - Install the optional dependencies `numexpr` and `bottleneck` for additional performance improvements
293 | - Caution against chaining too many rows of pandas operations in sequence: difficult to debug, chain only a couple of operations together to simplify your maintenance
294 | - **Filter your data before calculating** on the remaining rows rather than filtering after calculating
295 | - Check the schema of your DataFrames as they evolve -> tool like `bulwark`, you can visualize confirm that your expectations are being met
296 | - Large Series with low cardinality: `df['series_of_strings'].astype('category')` -> `value_counts` and `groupby` run faster and the Series consume less RAM
297 | - Convert 8-byte `float64` and `int64` to smaller datatypes -> 2-byte `float16` or 1-byte `int8` -> smaller range to further save RAM
298 | - Use the `del` keyword to delete earlier references and clear them from memory
299 | - Pandas `drop` method to delete unused columns
300 | - Persist the prepared DataFrame version to disk by using `to_pickle`
301 | - Avoid `inplace=True` -> are scheduled to be removed from the library over time
302 | - `Modin`, `cuDF`
303 | - `Vaex`: work on very large datasets that exceed RAM by using lazy evaluation while retaining a similar interface to Pandas -> large datasets and string-heavy operations
304 | 
305 | # Ch7. Compiling to C
306 | To make code run faster:
307 | - Make it do less work
308 | - Choose good algorithms
309 | - Reduce the amount of data you're processing
310 | - Execute fewer instructions -> compile your code down to machine code
311 | 
312 | ## Python offers
313 | - `Cython`: pure C-based compiling
314 | - `Numba`: LLVM-based compiling
315 | - `PyPy`: replacement virtual machine which includes a built-in just-in-time (JIT) compiler
316 | 
317 | ## What sort of speed gains are possible?
318 | Compiling generate more gains when the code:
319 | - is mathematical
320 | - has lots of loops that repeat the same operations many times
321 | 
322 | Unlikely to show speed up:
323 | - calls to external libraries (regexp, string operations, calls to database)
324 | - programs that are I/O-bound
325 | 
326 | ## JIT versus AOT compilers
327 | - **AOT (ahead of time)**: `Cython` -> you'll have a library that can instantly be used -> best speedups, but requires the most manual effort
328 | - **JIT (just in time)**: `Numba`, `PyPy` -> you don't have to do much work up front, but you have a "cold start" problem -> impressive speedups with little manual intervention
329 | 
330 | ## Why does type information help the code run faster?
331 | Python is dynamically typed -> keeping the code generic makes it run more slowly
332 | 
333 | > “Inside a section of code that is CPU-bound, it is often the case that the types of variables do not change. This gives us an opportunity for **static compilation and faster code execution**”
334 | 
335 | ## Using a C compiler
336 | `Cython` uses `gcc`: good choice for most platforms; well supported and quite advanced
337 | 
338 | ## Cython
339 | - Compiler that converts type-annotaded (C-like) Python into a compiled extension module
340 | - Wide used and mature
341 | - `OpenMP` support: possible to convert parallel problems into multiprocessing-aware modules
342 | - `pyximport`: simplified build system
343 | - Annotation option that output an HTML file -> more yellow = more calls into the Python virtual machine; more white = more non-Python C code
344 | 
345 | Lines that cost the most CPU time:
346 | - inside tight inner loops
347 | - dereferencing `list`, `array` or `np.array` items
348 | - mathematical operations
349 | 
350 | `cdef` keyword: declare variables inside the function body. These must be declared at the top of the function, as that’s a requirement from the C language specification
351 | 
352 | > **Strength reduction**: writing equivalent but more specialized code to solve the same problem. Trade worse flexibility (and possibly worse readability) for faster execution
353 | 
354 | `memoryview`: allows the same low-level access to any object that implements the buffer interface, including `numpy` arrays and Python arrays
355 | 
356 | ## Numba
357 | - JIT compiler that specializes in `numpy` code, which it compiles via LLVM compiler at runtime
358 | - You provide a decorator telling it which functions to focus on and then you let Numba take over
359 | - `numpy` arrays and nonvectorized code that iterates over many items: Numba should give you a quick and very painless win. 
360 | - Numba does not bind to external C libraries (which Cython can do), but it can automatically generate code for GPUs (which Cython cannot).
361 | - OpenMP parallelization support with `prange`
362 | - Break your code into small (<10 line) and discrete functions and tackle these one at a time
363 | 
364 | ```
365 | from numba import jit
366 | 
367 | @jit()
368 | def my_fn():
369 | ```
370 | 
371 | ## PyPy
372 | - Alternative implementation of the Python language that includes a tracing just-in-time compiler
373 | - Offers a faster experience than CPython
374 | - Uses a different type of garbage collector (modified mark-and-sweep) than CPython (reference counting) = may clean up an unused object much later
375 | - PyPy can use a lot of RAM
376 | - `vmprof`: lightweight sampling profiler
377 | 
378 | ## When to use each technology
379 | ![compiler_options](./compiler_options.png)
380 | 
381 | - `Numba`: quick wins for little effort; young project
382 | - `Cython`: best results for the widest set of prolbmes; requires more effort; mix Python and C annotations
383 | - `PyPy`: strong option if you're not using `numpy` or other hard-to-port C extensions
384 | 
385 | ### Other upcoming projects
386 | - Pythran
387 | - Transonic
388 | - ShedSkin
389 | - PyCUDA
390 | - PyOpenCL
391 | - Nuitka
392 | 
393 | ## Graphics Processing Units (GPUs)
394 | Easy-to-use GPU mathematics libraries:
395 | - TensorFlow
396 | - PyTorch
397 | 
398 | ### Dynamic graphs: PyTorch
399 | Static computational graph tensor library that is particularly user-friendly and has a very intuitive API for anyone familiar with `numpy`
400 | 
401 | > *Static computational graph*: performing operations on `PyTorch` objects creates a dynamic definition of a program that gets compiled to GPU code in the background when it is executed -> changes to the Python code automatically get reflected in changes in the GPU code without an explicit compilation step needed
402 | 
403 | ### Basic GPU profiling
404 | - `nvidia-smi`: inspect the resource utilization of the GPU
405 | - Power usage is a good proxy for judging how much of the GPU's compute power is being used -> more power the GPU is drawing = more compute it is currently doing
406 | 
407 | ### When to use GPUs
408 | - Task requires mainly linear algebra and matrix manipulations (multiplication, addition, Fourier transforms)
409 | - Particularly true if the calculation can happen on the GPU uninterrupted for a period of time before being copied back into system memory
410 | - GPU can run many more tasks at once than the CPU can, but each of those tasks run more slowly on the GPU than on the CPU
411 | - Not a good tool for tasks that require exceedingly large amounts of data, many conditional manipulations of the data, or changing data
412 | 
413 | 1. Ensure that the memory use of the problem will fit withing the GPU
414 | 2. Evaluate whether the algorithm requires a lot of branching conditions versus vectorized operations
415 | 3. Evaluate how much data needs to be moved between the GPU and the CPU
416 | 
417 | # Ch8. Asynchronous I/O
418 | *I/O bound program*: the speed is bounded by the efficiency of the input/output
419 | 
420 | Asynchronous I/O helps utilize the wasted *I/O wait* time by allowing us to perform other operations while we are in that state
421 | 
422 | ## Introduction to asynchronous programming
423 | - *Context switch*: when a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request
424 | - **Callback paradigm**: functions are called with an argument that is generally called the callback -> instead of the function returing its value, it call the callback function with the value instead -> long chains = "callback hell"
425 | - **Future paradigm**: an asynchronous function returns a `Future` object, which is a promise of a future result
426 | - `asyncio` standard library module and PEP 492 made the future's mechanism native to Python
427 | 
428 | ## How does async/await work?
429 | - `async` function (defined with `async def`) is called a *coroutine*
430 | - Coroutines are implemented with the same philosophies as generators
431 | - `await` is similar in function to a `yield` -> the execution of the current function gets paused while other code is run
432 | 
433 | ### Gevent
434 | - Patches the standard library with asynchronous I/O functions, 
435 | - Has a `Greenlets` object that can be used for concurrent execution
436 | - Ideal solution for mainly CPU-based problems that sometimes involve heavy I/O
437 | 
438 | ### tornado
439 | - Frequently used package for asynchronous I/O in Python
440 | - Originally developed by Facebook primarily for HTTP clients and servers
441 | - Ideal for any application that is mostly I/O-bound and where most of the application should be asynchronous
442 | - Performant web server
443 | 
444 | ### aiohttp
445 | - Built entirely on the `asyncio` library
446 | - Provides both HTTP client and server functionality
447 | - Uses a similar API to `tornado`
448 |   
449 | ### Batched results
450 | - *Pipelining*: batching results -> can help lower the burden of an I/O task
451 | - Good compromise between the speeds of asynchronous I/O and the ease of writing serial programs
452 | 
453 | # Ch9. The multiprocessing module
454 | - Additional process = more communication overhead = decrease available RAM -> rarely get a full *n*-times speedup
455 | - If you run out of RAM and the system reverts to using the disk’s swap space, any parallelization advantage will be massively lost to the slow paging of RAM back and forth to disk
456 | - Using hyperthreads: CPython uses a lot of RAM -> hyperthreading is not cache friendly. Hyperthreads = added bonus and not a resource to be optimized against -> adding more CPUs is more economical than tuning your code
457 | - **Amdahl's law**: if only a small part of your code can be parallelized, it doesn't matter how many CPUs you throw at it; it still won't run much faster overall
458 | - `multiprocessing` module: process and thread-based parallel processing, share work over queues, and share data among processes -> focus: single-machine multicore parallelism
459 | - `multiprocessing`: higher level, sharing Python data structures
460 | - `OpenMP`: works with C primitive objects once you've compiled to C
461 | 
462 | > Keep the parallelism as simple as possible so that your development velocity is kept high
463 | 
464 | - *Embarrassingly parallel*: multiple Python processes all solving the same problem without communicating with one another -> not much penalty will be incurred as we add more and more Python processes
465 | 
466 | Typical jobs for the `multiprocessing` module:
467 | - Parallelize a CPU-bound task with `Process` or `Pool` objects
468 | - Parallelize an I/O-bound task in a `Pool` with threads using the `dummy` module
469 | - Share pickled work via a `Queue`
470 | - Share state between parallelized workers, including bytes, primitive datatypes, dictionaries, and lists
471 | 
472 | > `Joblib`: stronger cross-platform support than `multiprocessing`
473 | 
474 | ## Replacing multiprocessing with Joblib
475 | - `Joblib` is an improvement on `multiprocessing` 
476 | - Enables lightweight pipelining with a focus on:
477 |   - easy parallel computing 
478 |   - transparent disk-based caching of results 
479 | - It focuses on NumPy arrays for scientific computing
480 | - Quick wins:
481 |   - process a loop that could be embarrassingly parallel
482 |   - expensive functions that have no side effect
483 |   - able to share `numpy` data between processes
484 | - `Parallel` class: sets up the process pool
485 | - `delayed` decorator: wraps our target function so it can be applied to the instantiated `Parallel` object via an iterator
486 | 
487 | ### Intelligent caching of function call results
488 | `Memory` cache: decorator that caches functions results based on the input arguments to a disk cache
489 | 
490 | ### Using numpy
491 | - `numpy` is more cache friendly
492 | - `numpy` can achieve some level of additional speedup around threads by working outside the GIL
493 | 
494 | ## Asynchronous systems
495 | Require a special level of patience. Suggestions:
496 | - K.I.S.S.
497 | - Avoiding asynchronous self-contained systems if possible, as they will grow in complexity and quickly become hard to maintain
498 | - Using mature libraries like `gevent` that give you tried-and-tested approaches to dealing with certain problem sets
499 | 
500 | ## Interprocess Communication (IPC)
501 | - Cooperation cost can be high: synchronizing data and checking the shared data
502 | - Sharing state tends to make things complicated
503 | - IPC is fairly easy but generally comes with a cost
504 | 
505 | ### multiprocessing.Manager()
506 | - Lets us share higher-level Python objects between processes as managed shared objects; the lower-level objects are wrapped in proxy objects 
507 | - The wrapping and safety have a speed cost but also offer great flexibility. 
508 | - You can share both lower-level objects (e.g., integers and floats) and lists and dictionaries.
509 | 
510 | ### Redis
511 | - **Key/value in-memory storage engine**. It provides its own locking and each operation is atomic, so we don’t have to worry about using locks from inside Python (or from any other interfacing language).
512 | - Lets you share state not just with other Python processes but also other tools and other machines, and even to expose that state over a web-browser interface
513 | - Redis lets you store: Lists of strings; Sets of strings; Sorted sets of strings; Hashes of strings
514 | - Stores everything in RAM and snapshots to disk
515 | - Supports master/slave replication to a cluster of instances
516 | - Widely used in industry and is mature and well trusted
517 | 
518 | ## mmap
519 | - Memory-mapped (shared memory) solution
520 | - The bytes in a shared memory block are not synchronized and they come with very little overhead
521 | - Bytes act like a file -> block of memory with a file-like interface
522 | 
523 | # Ch10. Clusters and Job Queues
524 | *Cluster*: collection of computers working together to solve a common task
525 | 
526 | Before moving to a clustered solution:
527 | - Profile your system to understand the bottlenecks
528 | - Exploit compile solutions (Numba, Cython)
529 | - Exploit multiple cores on a single machine (Joblib, multiprocessing)
530 | - Exploit techniques for using less RAM
531 | - Really need a lot of CPUs, high resiliency, rapid speed of response, ability to process data from disks in parallel
532 | 
533 | ## Benefits of clustering
534 | - Easily scale computing requirements
535 | - Improve reliability
536 | - Dynamic scaling
537 | 
538 | ## Drawbacks of clustering
539 | - Change in thinking
540 | - Latency between machines
541 | - Sysadmin problems: software versions between machines, are other machines working?
542 | - Moving parts that need to be in sync
543 | - "If you don't have a documented restart plan, you should assume you'll have to write one at the worst possible time"
544 | 
545 | > Using a cloud-based cluster can mitigate a lot of these problems, and some cloud providers also offer a spot-priced market for cheap but temporary computing resources.
546 | 
547 | - A system that's easy to debug *probably* beats having a faster system
548 | - Engineering time and the cost of downtime are *probably* your largest expenses
549 | 
550 | ## Parallel Pandas with Dask
551 | - Provide a suite of parallelization solutions that scales from a single core on a laptop to multicore machines to thousands of cores in a cluster.
552 | - "Apache Spark lite” 
553 | - For `Pandas` users: larger-than-RAM datasets and desire for multicore parallelization
554 | 
555 | ### Dask
556 | - *Bag*: enables parallelized computation on unstructured and semistructured data
557 | - *Array*: enables distributed and larger-than-RAM `numpy` operations
558 | - *Distributed DataFrame*: enables distributed and larger-than-RAM `Pandas` operations
559 | - *Delayed*: parallelize chains of arbitrary Python functions in a lazy fashion
560 | - *Futures*: interface that includes `Queue` and `Lock` to support task collaboration
561 | - *Dask-ML*: scikit-learn-like interface for scalable machine learning
562 | 
563 | > You can use Dask (and Swifter) to parallelize any side-effect-free function that you’d usually use in an `apply` call
564 | 
565 | - `npartitions` = # cores
566 | 
567 | #### Swifter
568 | Builds on Dask to provide three parallelized options with very simple calls: `apply`, `resample` and `rolling`
569 | 
570 | ### Vaex
571 | - String-heavy DataFrames
572 | - Larger-than-RAM datasets
573 | - Subsets of a DataFrame -> Implicit lazy evaluation
574 | 
575 | ## NSQ for robust production clustering
576 | - Highly performant distributed messaging platform
577 | - *Queues*: type of buffer for messages
578 | - *Pub/subs*: describes who gets what messages (*publisher/subscriber*)
579 | 
580 | # Ch11. Using less RAM
581 | - Counting the amount of RAM used by Python object is tricky -> if we ask the OS for a count of bytes used, it will tell us the total amount allocated to the process
582 | - Each unique object has a memory cost
583 | 
584 | ## Objects for primitives are expensive
585 | `memory_profiler`
586 | 
587 | ```
588 | %load_ext memory_profiler
589 | 
590 | %memit <operation>
591 | ```
592 | 
593 | ### The `array` module stores many primitive objects cheaply
594 | - Creates a contiguos block of RAM to hold the underlying data. Which data structures:
595 |   - integers, floats and characters
596 |   - *not* complex numbers or classes
597 | - Good to pass the array to an external process or use only some of the data (not to compute on them)
598 | - Using a regular `list` to store many numbers is much less efficient in RAM than using an `array` object
599 | - `numpy` arrays are almost certainly a better choice if you are doing anything heavily numeric:
600 |   - more datatype options
601 |   - many specialized and fast functions
602 | 
603 | ### Using less RAM in NumPy with NumExpr
604 | `NumExpr` is a tool that both speeds up and reduces the size of intermediate operations
605 | 
606 | > Install the optional NumExpr when using Pandas (Pandas does not tell you if you haven’t installed NumExpr) -> calls to `eval` will run more quickly -> import numpexpr: if this fails, install it! 
607 | 
608 | - NumExpr breaks the long vectors into shorter, cache-friendly chunks and processes each in series, so local chunks of results are calculated in a cache-friendly way
609 | 
610 | ## Bytes versus Unicode
611 | - Python 3.x, all strings are Unicode by default, and if you want to deal in bytes, you'll explicitly create a `byte` sequence
612 | - **UTF-8 encoding** of a Unicode object uses 1 byte per ASCII character and more bytes for less frequently seen characters
613 | 
614 | ## More efficient tree structures to represent strings
615 | - **Tries**: share common prefixes
616 | - **DAWG**: share common prefixes and suffixes
617 | - Overlapping sequences in your strings -> you'll likely see a RAM improvement
618 | - Save RAM and time in exchange for a little additional effort in preparation
619 | - Unfamiliar data structures to many developers -> isolate in a module to simplify maintenance
620 | 
621 | ### Directed Acyclic Word Graph (DAWG)
622 | Attemps to efficiently represent strings that share common prefixes and suffixes
623 | 
624 | ### Marisa Trie
625 | *Static trie* using Cython bindings to an external library -> it cannot be modified after construction
626 | 
627 | ## Scikit-learn's DictVectorizer and FeatureHasher
628 | - `DictVectorizer`: takes a dictionary of terms and their frequences and converts them into a variable-width sparse matrix -> it is possible to revert the process
629 | - `FeatureHasher`: converts the same dictionary of terms and frequencies into a fixed-width sparse matrix -> it doesn't store a vocabulary and instead employs a hashing algorithm to assign token frequencies to columns -> can't convert it back to the original token from hash
630 | 
631 | ## ScyPy's Sparse Matrices
632 | - Matrix in which most matrix elements are 0
633 | - `C00` matrices: simplest implementation: each non-zero element we store the value in addition to the location of the value -> each non-zero value = 3 numbers stored -> used only to contruct sparse matrices and not for actual computation
634 | - `CSR/CSC` is preferred for computation
635 | 
636 | > Push and pull of speedups with sparse arrays: balance between losing the use of efficient caching and vectorization versus not having to do a lot of the calculations associated with the zero values of the matrix
637 | 
638 | Limitations:
639 | - Low amount of support
640 | - Multiple implementations with benefits and drawbacks
641 | - May require expert knowledge
642 | 
643 | ## Tips for using less RAM
644 | > "If you can avoid putting it into RAM, do. Everything you load costs you RAM"
645 | 
646 | - Numeric data: switch to using `numpy` arrays
647 | - Very sparse arrays: SciPy's sparse array functionality
648 | - Strings: stick to `str` rather than `bytes`
649 | - Many Unicode objects in a static structure: DAWG and trie structures
650 | - Lots of bit strings: `numpy` and the `bitarray` package
651 | 
652 | ## Probabilistic Data Structures
653 | - Make trade-offs in accuracy for immense decrease in memory usage
654 | - The number of operations you can do on them is much more restricted
655 | 
656 | > “Probabilistic data structures are fantastic when you have taken the time to understand the problem and need to put something into production that can answer a very small set of questions about a very large set of data”
657 | 
658 | - "lossy compression": find an alternative representation for the data that is more compact and contains the relevant information for answering a certain set of questions
659 | 
660 | ### Morris counter
661 | Keeps track of an exponent and models the counted state as `2^exponent` -> provides an *order of magnitude* estimate
662 | 
663 | ### K-Minimum values
664 | If we keep the `k` smallest unique hash values we have seen, we can **approximate the overall spacing between hash values** and infer the total number of items
665 | - *idempotence*: if we do the same operation, with the same inputs, on the structure multiple times, the state will not be changed
666 | 
667 | ### Bloom filters
668 | - Answer the question of **whether we've seen an item before**
669 | - Work by having multiple hash values in order to represent a value as multiple integers. If we later see something with the same set of integers, we can be reasonably confident that it is the same value
670 | - **No false negatives and a controllable rate of false positives**
671 | - Set to have error rates below 0.5%
672 | 
673 | # Ch12. Lessons from the field
674 | 


--------------------------------------------------------------------------------
/learning-python-design-patterns/cover.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/learning-python-design-patterns/cover.jpg


--------------------------------------------------------------------------------
/learning-python-design-patterns/notes.md:
--------------------------------------------------------------------------------
  1 | # Learning Python Design Patterns
  2 | 
  3 | Author: Chetan Giridhar
  4 | 
  5 | [Available here](https://www.amazon.com/Learning-Python-Design-Patterns-Second-ebook/dp/B018XYKNOM)
  6 | 
  7 | ![learning-python-design-patterns](cover.jpg)
  8 | 
  9 | - [Learning Python Design Patterns](#learning-python-design-patterns)
 10 | - [Ch1. Introduction to design patterns](#ch1-introduction-to-design-patterns)
 11 |   - [Understanding object-oriented programming](#understanding-object-oriented-programming)
 12 |     - [Classes](#classes)
 13 |     - [Methods](#methods)
 14 |   - [Major aspects of OOP](#major-aspects-of-oop)
 15 |     - [Encapsulation](#encapsulation)
 16 |     - [Polymorphism](#polymorphism)
 17 |     - [Inheritance](#inheritance)
 18 |     - [Abstraction](#abstraction)
 19 |     - [Composition](#composition)
 20 |   - [Object-oriented design principles](#object-oriented-design-principles)
 21 |     - [The open/close principle](#the-openclose-principle)
 22 |     - [The inversion of control principle](#the-inversion-of-control-principle)
 23 |     - [The interface segregation principle](#the-interface-segregation-principle)
 24 |     - [The single responsibility principle](#the-single-responsibility-principle)
 25 |     - [The substitution principle](#the-substitution-principle)
 26 |   - [The concept of design patterns](#the-concept-of-design-patterns)
 27 |     - [Advantages of design patterns](#advantages-of-design-patterns)
 28 |     - [Patterns for dynamic languages](#patterns-for-dynamic-languages)
 29 |   - [Classifying patterns](#classifying-patterns)
 30 |     - [Creational patterns](#creational-patterns)
 31 |     - [Structural patterns](#structural-patterns)
 32 |     - [Behavioral patterns](#behavioral-patterns)
 33 | - [Ch2. The singleton design pattern](#ch2-the-singleton-design-pattern)
 34 |   - [Monostate Singleton pattern](#monostate-singleton-pattern)
 35 |   - [Singletons and metaclasses](#singletons-and-metaclasses)
 36 |   - [Drawbacks](#drawbacks)
 37 | - [Ch3. The factory pattern - building factories to create objects](#ch3-the-factory-pattern---building-factories-to-create-objects)
 38 |   - [Understanding the factory pattern](#understanding-the-factory-pattern)
 39 |   - [The simple factory pattern](#the-simple-factory-pattern)
 40 |   - [The factory method pattern](#the-factory-method-pattern)
 41 |   - [The abstract factory pattern](#the-abstract-factory-pattern)
 42 |   - [Factory method versus abstract factory method](#factory-method-versus-abstract-factory-method)
 43 | - [Ch4. The façade pattern - being adaptive with façade](#ch4-the-façade-pattern---being-adaptive-with-façade)
 44 |   - [Understanding Structural design patterns](#understanding-structural-design-patterns)
 45 |   - [Understanding the Façade design pattern](#understanding-the-façade-design-pattern)
 46 |   - [Main participants](#main-participants)
 47 |   - [The principle of least knowledge](#the-principle-of-least-knowledge)
 48 |   - [The Law of Demeter](#the-law-of-demeter)
 49 | - [Ch5. The proxy pattern - controlling object access](#ch5-the-proxy-pattern---controlling-object-access)
 50 |   - [Data Structure components](#data-structure-components)
 51 |   - [Different types of proxies](#different-types-of-proxies)
 52 |     - [Decorator vs Proxy](#decorator-vs-proxy)
 53 |     - [Disadvantages](#disadvantages)
 54 | - [Ch6. The observer pattern - keeping objects in the know](#ch6-the-observer-pattern---keeping-objects-in-the-know)
 55 |   - [Behavioral patterns](#behavioral-patterns-1)
 56 |   - [Understanding the observer design pattern](#understanding-the-observer-design-pattern)
 57 |   - [The pull model](#the-pull-model)
 58 |   - [The push model](#the-push-model)
 59 |   - [Loose coupling and the observer pattern](#loose-coupling-and-the-observer-pattern)
 60 | - [Ch7. The command pattern - encapsulating invocation](#ch7-the-command-pattern---encapsulating-invocation)
 61 |   - [Understanding the command design pattern](#understanding-the-command-design-pattern)
 62 |     - [Intentions](#intentions)
 63 |     - [Scenarios of use](#scenarios-of-use)
 64 |   - [Advantages](#advantages)
 65 |   - [Disadvantages](#disadvantages-1)
 66 | - [Ch8. The templated method pattern - encapsulating algorithm](#ch8-the-templated-method-pattern---encapsulating-algorithm)
 67 |   - [Use cases](#use-cases)
 68 |   - [Intentions](#intentions-1)
 69 |   - [Terms](#terms)
 70 |   - [Hooks](#hooks)
 71 |   - [The Hollywood principle](#the-hollywood-principle)
 72 |   - [Advantages](#advantages-1)
 73 |   - [Disadvantages](#disadvantages-2)
 74 | - [Ch9. Model-View-Controller - Compound patterns](#ch9-model-view-controller---compound-patterns)
 75 |   - [The Model-View-Controller pattern](#the-model-view-controller-pattern)
 76 |     - [Terms](#terms-1)
 77 |     - [Intention](#intention)
 78 |   - [The MVC pattern in the real world](#the-mvc-pattern-in-the-real-world)
 79 |   - [Benefits of the MVC pattern](#benefits-of-the-mvc-pattern)
 80 | - [Ch10. The state design pattern](#ch10-the-state-design-pattern)
 81 |   - [Understanding the state design pattern](#understanding-the-state-design-pattern)
 82 |   - [Advantages](#advantages-2)
 83 |   - [Disadvantages](#disadvantages-3)
 84 | - [Ch11. AntiPatterns](#ch11-antipatterns)
 85 |   - [Software development AntiPatterns](#software-development-antipatterns)
 86 |     - [Spaghetti code](#spaghetti-code)
 87 |     - [Golden Hammer](#golden-hammer)
 88 |     - [Lava Flow](#lava-flow)
 89 |     - [Copy-and-paste or cut-and-paste programming](#copy-and-paste-or-cut-and-paste-programming)
 90 |   - [Software architecture AntiPatterns](#software-architecture-antipatterns)
 91 |     - [Reinventing the wheel](#reinventing-the-wheel)
 92 |     - [Vendor lock-in](#vendor-lock-in)
 93 |     - [Design by committee](#design-by-committee)
 94 | 
 95 | # Ch1. Introduction to design patterns
 96 | 
 97 | ## Understanding object-oriented programming
 98 | - Concept of *objects* that have attributes (data members) and procedures (member functions)
 99 | - Procedures are responsible for manipulating the attributes
100 | - Objects, which are instances of classes, interact among each other to serve the purpose of an application under development
101 | 
102 | ### Classes
103 | - Define objects in attributes and behaviors (methods)
104 | - Classes consist of constructors that provide the initial state for these objects
105 | - Are like templates and hence can be easily reused
106 | 
107 | ### Methods
108 | - Represent the behavior of the object
109 | - Work on attributes and also implement the desired functionality
110 | 
111 | ## Major aspects of OOP
112 | 
113 | ### Encapsulation
114 | - An object's behavior is kept hidden from the outside world or objects keep their state information private
115 | - Clients can't change the object's internal state by directly acting on them
116 | - Clients request the object by sending requests. Based on the type, objects may respond by changing their internal state using special member functions such as `get` and `set`
117 | 
118 | ### Polymorphism
119 | - Can be of two types:
120 |   - An object provides different implementations of the method based on input parameters
121 |   - The same interface can be used by objects of different types
122 | - In Python polymorphism is a feature built-in for the language (e.g. + operator)
123 | 
124 | ### Inheritance
125 | - Indicates that one class derives (most of) its functionality from the parent class
126 | - An option to reuse functionality defined in the base class and allow independent extensions of the original software implementation
127 | - Creates hierarchy via the relationships among objects of different classes 
128 | - Python supports multiple inheritance (multiple base classes)
129 | 
130 | ### Abstraction
131 | - Provides a simple interface to the clients. Clients can interact with class objects and call methods defined in the interface
132 | - Abstracts the complexity of internal classes with an interface so that the client need not be aware of internal implementations
133 | 
134 | ### Composition
135 | - Combine objects or classes into more complex data structures or software implementations
136 | - An object is used to call member functions in other modules thereby making base functionality available across modules without inheritance
137 | 
138 | ## Object-oriented design principles
139 | 
140 | ### The open/close principle
141 | > **Classes or objects and methods should be open for extension but closed for modifications**
142 | 
143 | - Make sure you write your classes or modules in a generic way
144 | - Existing classes are not changed reducing the chances of regression
145 | - Helps maintain backward compatibility
146 | 
147 | ### The inversion of control principle
148 | > **High-level modules shouldn't be dependent on low-level modules; they should be dependent on abstractions. Details should depend on abstractions and not the other way round**
149 | 
150 | - The base module and dependent module should be decoupled with an abstraction layer in between
151 | - The details of your class should represent the abstractions
152 | - The tight coupling of modules is no more prevalent and hence no complexity/rigidity in the system
153 | - Easy to deal with dependencies across modules in a better way
154 | 
155 | ### The interface segregation principle
156 | > **Clients should not be force to depend on interfaces they don't use**
157 | 
158 | - Forces developers to write thin interfaces and have methods that are specific to the interface
159 | - Helps you not to populate interfaces by adding unintentional methods
160 | 
161 | ### The single responsibility principle
162 | > **A class should have only one reason to change**
163 | 
164 | - If a class is taking care of two functionalities, it is better to split them
165 | - Functionality = a reason to change
166 | - Whenever there is a change in one functionality, this particular class needs to change, and nothing else
167 | - If a class has multiple functionalities, the dependent classes will have to undergo changes for multiple reasons, which gets avoided
168 | 
169 | ### The substitution principle
170 | > **Derived classes must be able to completely substitute the base classes**
171 | 
172 | ## The concept of design patterns
173 | - Solutions to given problems
174 | - Design patterns are discoveries and not a invention in themselves
175 | - Is about learning from others' successes rather than your own failures!
176 | 
177 | ### Advantages of design patterns
178 | - Reusable across multiple projects
179 | - Architectural level of problems can be solved
180 | - Time-tested and well-proven, which is the experience of developers and architects
181 | - They have reliability and dependence
182 | 
183 | ### Patterns for dynamic languages
184 | Python:
185 | - Types or classes are objects at runtime
186 | - Variables can have type as a value and can be modified at runtime
187 | - Dynamic languages have more flexibility in terms of class restrictions
188 | - Everything is public by default
189 | - Design patterns can be easily implemented in dynamic languages
190 | 
191 | ## Classifying patterns
192 | - Creational
193 | - Structural
194 | - Behavioral
195 | 
196 | > Classification of patterns is done based primarily on how the objects get created, how classes and objects are structured in a software application, and also covers the way objects interact among themselves
197 | 
198 | ### Creational patterns
199 | - Work on the basis of how objects can be created
200 | - Isolate the details of object creation
201 | - Code is independent of the type of object to be created
202 | 
203 | ### Structural patterns
204 | - Design the structure of objects and classes so that they can compose to achieve larger results
205 | - Focus on simplifying the structure and identifying the relationship between classes and objects
206 | - Focus on class inheritance and composition
207 | 
208 | ### Behavioral patterns
209 | - Concerned with the interaction among objects and responsibilities of objects
210 | - Objects should be able to interact and still be loosely coupled
211 | 
212 | # Ch2. The singleton design pattern
213 | - Typically used in logging or database operations, printer spoolers, thread pools, caches, dialog boxes, registry settings, and so on
214 | - Ensure that only one object of the class gets created
215 | - Provide an access point for an object that is global to the program
216 | - Control concurrent access to resources that are shared
217 | - Make the constructor private and create a static method that does the object initialization
218 | - Override the `__new__` method (Python's special method to instantiate objects) to control the object creation
219 | - Another use case: **lazy instantiation**. Makes sure that the object gets created when it's actually needed
220 | - All modules are Singletons by default because of Python's importing behavior
221 | 
222 | ## Monostate Singleton pattern
223 | - All objects share the same state
224 | - Assign the `__dict__` variable with the `__shared_state` class variable. Python uses `__dict__` to store the state of every object of a class
225 | 
226 | ## Singletons and metaclasses
227 | - A metaclass is a class of a class
228 | - The class is an instance of its metaclass
229 | - Programmers get an opportunity to create classes of their own type from the predefined Python classes
230 | 
231 | ## Drawbacks
232 | - Singletons have a global point of access
233 | - Al classes that are dependent on global variables get tightly coupled as a change to the global data by one class can inadvertently impact the other class
234 | 
235 | # Ch3. The factory pattern - building factories to create objects
236 | 
237 | ## Understanding the factory pattern
238 | - Factory = a class that is responsible for creating objects of other types
239 | - The class that acts as a factory has an object and methods associated with it
240 | - The client calls this method with certain parameters; objects of desired types are created in turn and returned to the client by the factory
241 | 
242 | **Advantages**
243 | - Loose coupling: object creation can be independent of the class implementation
244 | - The client only needs to know the interface, methods, and parameters that need to be passed to create objects of the desired type (simplifies implementations for the client)
245 | - Adding another class to the factory to create objects of another type can be easily done without the client changing the code
246 | 
247 | ## The simple factory pattern
248 | - Not a pattern in itself
249 | - Helps create objects of different types rather than direct object instantiation
250 | 
251 | ## The factory method pattern
252 | - We define an interface to create objects, but instead of the factory being responsible for the object creation, the responsibility is deferred to the subclass that decides the class to be instantiated
253 | - Creation is through inheritance and not through instantiation
254 | - Makes the design more customizable. It can return the same instance or subclass rather than an object of a certain type
255 | 
256 | > The factory method pattern defines an interface to create an object, but defers the decision ON which class to instantiate to its subclasses
257 | 
258 | **Advantages**
259 | - Makes the code generic and flexible, not being tied to a certain class for instantiation. We're dependent on the interface (Product) and not on the ConcreteProduct class
260 | - Loose coupling: the code that creates the object is separate from the code that uses it
261 | - The client don't need to bother about what argument to pass and which class to instantiate -> the addition of new classes is easy and involves low maintenance
262 | 
263 | ## The abstract factory pattern
264 | 
265 | > Provide an interface to create families of related objects without specifying the concrete class
266 | 
267 | - Makes sure that the client is isolated from the creation of objects but allowed to use the objects created
268 | 
269 | ## Factory method versus abstract factory method
270 | 
271 | |                       **Factory method**                       |                          **Abstract Factory method**                           |
272 | | :------------------------------------------------------------: | :----------------------------------------------------------------------------: |
273 | |      Exposes a method to the client to create the objects      |             Contains one or more factory methods of another class              |
274 | | Uses inheritance and subclass to decide which object to create | Uses composition to delegate responsibility to create objects of another class |
275 | |                 Is used to create one product                  |                 Is about creating families of related products                 |
276 | 
277 | # Ch4. The façade pattern - being adaptive with façade
278 | 
279 | ## Understanding Structural design patterns
280 | - Describe how objects and classes can be combined to form larger structures. Structural patterns are a combination of class and object patterns
281 | - Ease the design by identifying simpler ways to realize or demonstrate relationships between entities
282 | - Class patterns: describe abstraction with the help of inheritance and provide a more useful program interface
283 | - Object patterns: describe how objects can be associated and composed to form larger objects
284 | 
285 | ## Understanding the Façade design pattern
286 | > Façade hides the complexities of the internal system and provides an interface to the client that can access the system in a very simplified way
287 | 
288 | - Provides an unified interface to a set of interfaces in a subsystem and defines a high-level interface that helps the client use the subsystem in an easy way
289 | - Discusses representing a complex subsystem with a single interface object -> it doesn't encapsulate the subsystem, but actually combines the underlying subsystems
290 | - Promotes the decoupling of the implementation with multiple clients
291 | 
292 | ## Main participants
293 | - **Façade**: wrap up a complex group of subsystems so that it can provide a pleasing look to the outside world
294 | - **System**: represents a set of varied subsystems that make the whole system compound and difficult to view or work with
295 | - **Client**: interact with the façade so that it can easily communicate with the subsystem and get the work completed (doesn't have to bother about the complex nature of the system)
296 | 
297 | ## The principle of least knowledge
298 | - Design principle behind Façade pattern
299 | - Reduce the interactions between objects to just a few friend that are close enough to you
300 | 
301 | ## The Law of Demeter
302 | - Design guideline:
303 |   - Each unit should have only limited knowledge of other units of the system
304 |   - A unit should talk to its friends only
305 |   - A unit should not know about the internal details of the object that it manipulates
306 | 
307 | > The principle of least knowledge and Law of Demeter are the same and both point to the philosophy of *loose coupling*
308 | 
309 | # Ch5. The proxy pattern - controlling object access
310 | > Proxy: a system that intermediates between the seeker and provider. Seeker is the one that makes the request, and provider delivers the resources in response to the request
311 | 
312 | - A proxy server encapsulates requests, enables privacy, and works well in distributed architectures
313 | - Proxy is a wrapper or agent object that wraps the real serving object
314 | - Provide a surrogate or placeholder for another object in order to control access to a real object
315 | - Some useful scenarios:
316 |   - Represents a complex system in a simpler way
317 |   - Acts as a shield against malicious intentions and protect the real object
318 |   - Provides a local interface for remote objects on different servers
319 |   - Provides a light handle for a higher memory-consuming object
320 | 
321 | ## Data Structure components
322 | - **Proxy**
323 | - **Subject/RealSubject**
324 | - **Client**
325 | 
326 | ## Different types of proxies
327 | - **Virtual proxy**: placeholder for objects that are very heavy to instantiate
328 | - **Remote proxy**: provides a local representation of a real object that resides on a remote server or different address space
329 | - **Protective proxy**: controls access to the sensitive matter object of `RealSubject`
330 | - **Smart proxy**: interposes additional actions when an object is accessed
331 | 
332 | | Proxy                                                                                                     | Façade                                                          |
333 | | --------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------- |
334 | | Provides you with a surrogate or placeholder for another object to control access to it                   | Provides you with an interface to large subsystems of classes   |
335 | | A Proxy object has the same interface as that of the target object and holds references to target objects | Minimizes the communication and dependencies between subsystems |
336 | | Acts as an intermediary between the client and object that is wrapped                                     | Provides a single simplified interface                          |
337 | 
338 | ### Decorator vs Proxy
339 | - Decorator adds behavior to the object that it decorates at runtime
340 | - Proxy controls access to an object
341 | 
342 | ### Disadvantages
343 | - Proxy pattern can increase the response time
344 | 
345 | # Ch6. The observer pattern - keeping objects in the know
346 | 
347 | ## Behavioral patterns
348 | - Focus on the responsibilities that an object has
349 | - Deal with the interaction among objects to achieve larger functionality
350 | - Objects should be able to interact with each other, **but they should still be loosely coupled**
351 | 
352 | ## Understanding the observer design pattern
353 | > An object (Subject) maintains a list of dependents (Observers) so that the Subject can notify all the Observers about the changes that it undergoes using any of the methods defined by the Observer
354 | 
355 | - Defines a one-to-many dependency between objects so that any change in one object will be notified to the other dependent objects automatically
356 | - Encapsulates the core component of the Subject
357 | 
358 | ## The pull model
359 | - Subject broadcasts to all the registered Observers when there is any change
360 | - Observer is responsible for getting the changes or pulling data from the subscriber when there is an amendment
361 | - Pull model is **ineffective**: involves two steps:
362 |   - Subject notifies the Observer
363 |   - Observer pulls the required data from the Subject
364 | 
365 | ## The push model
366 | - Changes are pushed by the Subject to the Observer
367 | - Subject can send detailed information to the Observer (even though it may not be needed) -> can result in sluggish response times when a large amount of data in sent by the Subject but is never actually used by the Observer
368 | 
369 | ## Loose coupling and the observer pattern
370 | - Coupling refers to the degree of knowledge that one object has about the other object that it interacts with
371 | 
372 | > Loosely-coupled designs allow us to build flexbile object-oriented systems that can handle changes because they reduce the dependency between multiple objects
373 | 
374 | - Reduces the risk that a change made within one element might create an unanticipated impact on the other elements
375 | - Simplifies testing, maintenance, and troubleshooting problems
376 | - System can be easily broken down into definable elements
377 | 
378 | # Ch7. The command pattern - encapsulating invocation
379 | > Behavioral design pattern in which an object is used to encapsulate all the information needed to perform an action or trigger an event at a later time
380 | 
381 | ## Understanding the command design pattern
382 | - A `Command` object knows about the `Receiver` objects and invokes a method of the `Receiver` object
383 | - Values for parameters of the receiver method are stored in the `Command` object
384 | - The invoker knows how to execute a command
385 | - The client creates a `Command` object and sets its receiver
386 | 
387 | ### Intentions
388 | - Encapsulating a request as an object
389 | - Allowing the parametrization of clients with different requests
390 | - Allowing to save the requests in a queue
391 | - Providing an object-oriented callback
392 | 
393 | ### Scenarios of use
394 | - Parametrizing objects depending on the action to be performed
395 | - Adding actions to a queue and executing requests at different points
396 | - Creating a structure for high-level operations that are based on smaller operations
397 | - E.g.:
398 |   - Redo or rollback operations
399 |   - Asynchronous task execution
400 | 
401 | ## Advantages
402 | - Decouples the classes that invoke the operation from the object that knows how to execute the operation
403 | - Provide a queue system
404 | - Extensions to add new commands are easy
405 | - A rollback system with the command pattern can be defined
406 | 
407 | ## Disadvantages
408 | - High number of classes and objects working together to achieve a goal
409 | - Every individual command is a `ConcreteCommand` class that increases the volume of classes for implementation and maintenance
410 | 
411 | # Ch8. The templated method pattern - encapsulating algorithm
412 | 
413 | ## Use cases
414 | - When multiple algorithms or classes implements similar or identical logic
415 | - The implementation of algorithms in subclasses helps reduce code duplication
416 | - Multiple algorithms can be defined by letting the subclasses implement the behavior through overriding
417 | 
418 | ## Intentions
419 | - Define a skeleton of an algorithm with primitive operations
420 | - Redefine certain operations of the subclass without changing the algorithm's structure
421 | - Achieve code reuse and avoid duplicate efforts
422 | - Leverage common interfaces or implementations
423 | 
424 | ## Terms
425 | - `AbstractClass`: Declares an interface to define the steps of the algorithm
426 | - `ConcreteClass`: Defines subclass-specific step definitions
427 | - `template_method()`: Defines the algorithm by calling the step methods
428 | 
429 | ## Hooks
430 | - Hook: a method that is declared in the abstract class
431 | - Give a subclass the ability to *hook into* the algorithm whenever needed
432 | - Not imperative for the subclass to use hooks
433 | 
434 | > We use abstract methods when the subclass must provide the implementation, and hook is used when it is optional for the subclass to implement it
435 | 
436 | ## The Hollywood principle
437 | - Design principle summarized by **Don't call us, we'll call you**
438 | - Relates to the template method -> it's the high-level abstract class that arranges the steps to define the algorithm
439 | 
440 | ## Advantages
441 | - No code duplication
442 | - Uses inheritance and not composition -> only a few methods need to be overriden
443 | - Flexibility lets subclasses decide how implement steps in an algorithm
444 | 
445 | ## Disadvantages
446 | - Confusing debugging and undestanding the sequence of flow
447 | - Documentation and strict error handling has to be done by the programmer
448 | - Maintenance can be a problem -> changes to any level can disturb the implementation
449 | 
450 | # Ch9. Model-View-Controller - Compound patterns
451 | > "A compound pattern combines two or more patterns into a solution that solves a recurring or general problem" - GoF
452 | 
453 | A compound pattern is not a set of patterns working together; it is a general solution to a problem
454 | 
455 | ## The Model-View-Controller pattern
456 | - Model represents the data and business logic: how information is stored and queried
457 | - View is nothing but the representation: how it is presented
458 | - Controller is the one that directs the model and view to behave in a certain way: it's the glue between the two
459 | - The view and controller are dependent on the model, but not the other way around
460 | 
461 | ### Terms
462 | - **Model - knowledge of the application**: store and manipulate data (create, modify and delete). Has state and methods to change states, but is not aware of how the data would be seen by the client
463 | - **View - the appearance**: build user interfaces and data displays (it should not contain any logic of its own and just display the data it receives)
464 | - **Controller - the glue**: connects the model and view (it has methods that are used to route requests)
465 | - **User**: requests for certain results based on certain actions
466 | 
467 | ### Intention
468 | - Keep the data and presentation of the data separate
469 | - Easy maintenance of the class and implementation
470 | - Flexibility to change the way in which data is stored and displayed -> both are independent and hence have the flexibility to change
471 | 
472 | ## The MVC pattern in the real world
473 | - Django or Rails
474 | - **MTV (Model, Template, View)**: model is the database, templates are the views, and controllers are the views/routes
475 | 
476 | ## Benefits of the MVC pattern
477 | - Easy maintenance
478 | - Loose coupling
479 | - Decrease complexity
480 | - Development efforts can run independently
481 | 
482 | # Ch10. The state design pattern
483 | - Behavioral design pattern
484 | - Sometimes referred to as an **objects for states** pattern
485 | - Used to develop Finite State Machines and helps to accommodate State Transaction Actions
486 | 
487 | ## Understanding the state design pattern
488 | - `State`: an interface that encapsulates the object's behavior. This behavior is associated with the state of the object
489 | - `ConcreteState`: a subclass that implements the `State` interface -> implements the actual behavior associated with the object's particular state
490 | - `Context`: the interface of interest to clients. Also maintains an instance of the `ConcreteState` subclass that internally defines the implementation of the object's particular state
491 | 
492 | ## Advantages
493 | - Removes the dependency on the if/else or switch/else conditional logic
494 | - Benefits of implementing polymorphic behavior, also easier to add states to support additional behavior
495 | - Improves cohesion: state-specific behaviors are aggregated into the `ConcreteState` classes, which are placed in one location in the code
496 | - Improves the flexibility to extend the behavior of the application and overall improves code maintenance
497 | 
498 | ## Disadvantages
499 | - Class explosion: every state needs to be defined with the help of `ConcreteState` -> might end up writing many more classes with a small functionality
500 | - `Context` class needs to be updated to deal with each behavior
501 | 
502 | # Ch11. AntiPatterns
503 | Four aspects of a bad design:
504 | - **Immobile**: hard to reuse
505 | - **Rigid**: any small change may in turn result in moving too many parts of the software
506 | - **Fragile**: any change results in breaking the existing system fairly easy
507 | - **Viscose**: changes are done in the code or envinronment itself to avoid difficult architectural level changes
508 | 
509 | > An AntiPattern is an outcome of a solution to recurring problems so that the outcome is innefective and becomes counterproductive
510 | 
511 | AntiPatterns may be the result of:
512 | - A developer not knowing the software development practices
513 | - A developer not applying design patterns in the correct context
514 | 
515 | ## Software development AntiPatterns
516 | Software deviates from the original code structure due to:
517 | - The tought process of the developer evolves with development
518 | - Use cases change based on customer feedback
519 | - Data structures designed initially may undergo change with functionality or scalability considerations
520 | 
521 | > Refactoring is one of the critical parts of the software development journey, which provides developers an opportunity to relook the data structures and think about scalability and ever-evolving customer's needs
522 | 
523 | ### Spaghetti code
524 | - Minimum reuse of structures is possible
525 | - Maintenance efforts are too high
526 | - Extension and flexibility to change is reduced
527 | 
528 | ### Golden Hammer
529 | - One solution is obsessively applied to all software projects
530 | - The product is describe, not by the features, but the technology used in development
531 | - In the company corridors, you hear developers talking, "That could have been better than using this"
532 | - Requirements are not completed and not in sync with user expectations
533 | 
534 | ### Lava Flow
535 | - Low code coverage for developed tests
536 | - Commented code without reasons
537 | - Obsolete interfaces, or developers try to work around existing code
538 | 
539 | ### Copy-and-paste or cut-and-paste programming
540 | - Similar type of issues across software application
541 | - Higher maintenance costs and increased bug life cycle
542 | - Less modular code base with the same code running into a number of lines
543 | - Inheriting issues that existed in the first place
544 | 
545 | ## Software architecture AntiPatterns
546 | > Software architecture looks at modeling the software that is well understood by the development and test teams, product managers, and other stakeholders
547 | 
548 | ### Reinventing the wheel
549 | - Too many solutions to solve one standard problem, with many of them not being well thought out
550 | - More time and resource utilization for the engineering team leading overbudgeting and more time to market
551 | - A closed system architecture (useful for only one product), duplication of efforts, and poor risk management
552 | 
553 | ### Vendor lock-in
554 | - Release cycles and product maintenance cycles of a company's product releases are directly dependent on the vendor's release time frame
555 | - The product is developed around the technology rather than on the customer's requirements
556 | - The product's time to market is unreliable and doesn't meet customer's expectations
557 | 
558 | ### Design by committee
559 | - Conflicting viewpoints between developers and architects even after the design is finalized
560 | - Overly complex design that is very difficult to document
561 | - Any change in the specification or design undergoes review by many, resulting in implementation delays
562 | 


--------------------------------------------------------------------------------
/master-large-data-python/cover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/master-large-data-python/cover.png


--------------------------------------------------------------------------------
/master-large-data-python/map_reduce.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/millengustavo/python-books/a2c810baa4e66399f778de5cd368771a2e17e4a2/master-large-data-python/map_reduce.png


--------------------------------------------------------------------------------
/master-large-data-python/notes.md:
--------------------------------------------------------------------------------
  1 | # Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
  2 | 
  3 | Authors: John T. Wolohan
  4 | 
  5 | [Available here](https://www.manning.com/books/mastering-large-datasets-with-python)
  6 | 
  7 | ![cover](cover.png)
  8 | 
  9 | - [Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code](#mastering-large-datasets-with-python-parallelize-and-distribute-your-python-code)
 10 | - [Ch1. Introduction](#ch1-introduction)
 11 |   - [Procedural programming](#procedural-programming)
 12 |   - [Parallel programming](#parallel-programming)
 13 |   - [The map function for transforming data](#the-map-function-for-transforming-data)
 14 |   - [The reduce function for advanced transformations](#the-reduce-function-for-advanced-transformations)
 15 |   - [Distributed computing for speed and scale](#distributed-computing-for-speed-and-scale)
 16 |   - [Hadoop: A distributed framework for map and reduce](#hadoop-a-distributed-framework-for-map-and-reduce)
 17 |   - [Spark for high-powered map, reduce, and more](#spark-for-high-powered-map-reduce-and-more)
 18 |   - [AWS Elastic MapReduce (EMR) - Large datasets in the cloud](#aws-elastic-mapreduce-emr---large-datasets-in-the-cloud)
 19 | - [Ch2. Accelerating large dataset work: Map and parallel computing](#ch2-accelerating-large-dataset-work-map-and-parallel-computing)
 20 |   - [Pattern](#pattern)
 21 |   - [Lazy functions for large datasets](#lazy-functions-for-large-datasets)
 22 |   - [Parallel processing](#parallel-processing)
 23 |     - [Problems](#problems)
 24 |       - [Inability to pickle data or functions](#inability-to-pickle-data-or-functions)
 25 |       - [Order-sensitive operations](#order-sensitive-operations)
 26 |       - [State-dependent operations](#state-dependent-operations)
 27 |   - [Other observations](#other-observations)
 28 | - [Ch3. Function pipelines for mapping complex transformations](#ch3-function-pipelines-for-mapping-complex-transformations)
 29 |   - [Helper functions and function chains](#helper-functions-and-function-chains)
 30 |     - [Creating a pipeline](#creating-a-pipeline)
 31 |       - [Compose](#compose)
 32 |       - [Pipe](#pipe)
 33 |   - [Summary](#summary)
 34 | - [Ch4. Processing large datasets with lazy workflows](#ch4-processing-large-datasets-with-lazy-workflows)
 35 |   - [Laziness](#laziness)
 36 |     - [Shrinking sequences with the filter function](#shrinking-sequences-with-the-filter-function)
 37 |     - [Combining sequences with zip](#combining-sequences-with-zip)
 38 |     - [Lazy file searching with iglob](#lazy-file-searching-with-iglob)
 39 |   - [Understanding iterators: the magic behind lazy Python](#understanding-iterators-the-magic-behind-lazy-python)
 40 |   - [Generators: functions for creating data](#generators-functions-for-creating-data)
 41 |   - [Simulations](#simulations)
 42 | - [Ch5. Accumulation operations with reduce](#ch5-accumulation-operations-with-reduce)
 43 |   - [Three parts of reduce](#three-parts-of-reduce)
 44 |     - [Accumulator functions](#accumulator-functions)
 45 |   - [Reductions](#reductions)
 46 |   - [Using map and reduce together](#using-map-and-reduce-together)
 47 |   - [Speeding up map and reduce](#speeding-up-map-and-reduce)
 48 | - [Ch6. Speeding up map and reduce with advanced parallelization](#ch6-speeding-up-map-and-reduce-with-advanced-parallelization)
 49 |   - [Getting the most out of parallel map](#getting-the-most-out-of-parallel-map)
 50 |     - [More parallel maps: `.imap` and `starmap`](#more-parallel-maps-imap-and-starmap)
 51 |       - [`.imap`](#imap)
 52 |       - [`starmap`](#starmap)
 53 |   - [Parallel reduce for faster reductions](#parallel-reduce-for-faster-reductions)
 54 | - [Ch7. Processing truly big datasets with Hadoop and Spark](#ch7-processing-truly-big-datasets-with-hadoop-and-spark)
 55 |   - [Distributed computing](#distributed-computing)
 56 |   - [Hadoop five modules](#hadoop-five-modules)
 57 |     - [YARN for job scheduling](#yarn-for-job-scheduling)
 58 |     - [The data storage backbone of Hadoop: HDFS](#the-data-storage-backbone-of-hadoop-hdfs)
 59 |     - [MapReduce jobs using Python and Hadoop Streaming](#mapreduce-jobs-using-python-and-hadoop-streaming)
 60 |   - [Spark for interactive workflows](#spark-for-interactive-workflows)
 61 |     - [PySpark for mixing Python and Spark](#pyspark-for-mixing-python-and-spark)
 62 | - [Ch8. Best practices for large data with Apache Streaming and mrjob](#ch8-best-practices-for-large-data-with-apache-streaming-and-mrjob)
 63 |   - [Unstructured data: Logs and documents](#unstructured-data-logs-and-documents)
 64 |   - [JSON for passing data between mapper and reducer](#json-for-passing-data-between-mapper-and-reducer)
 65 |   - [mrjob for pythonic Hadoop streaming](#mrjob-for-pythonic-hadoop-streaming)
 66 | - [Ch9. PageRank with map and reduce in PySpark](#ch9-pagerank-with-map-and-reduce-in-pyspark)
 67 |     - [Map-like methods in PySpark](#map-like-methods-in-pyspark)
 68 |     - [Reduce-like methods in PySpark](#reduce-like-methods-in-pyspark)
 69 |     - [Convenience methods in PySpark](#convenience-methods-in-pyspark)
 70 |       - [Saving RDDs to text files](#saving-rdds-to-text-files)
 71 | - [Ch10. Faster decision-making with machine learning and PySpark](#ch10-faster-decision-making-with-machine-learning-and-pyspark)
 72 |   - [Organizing the data for learning](#organizing-the-data-for-learning)
 73 |   - [Auxiliary classes](#auxiliary-classes)
 74 |   - [Evaluation](#evaluation)
 75 |   - [Cross-validation in PySpark](#cross-validation-in-pyspark)
 76 | - [Ch11. Large datasets in the cloud with Amazon Web Services and S3](#ch11-large-datasets-in-the-cloud-with-amazon-web-services-and-s3)
 77 |   - [Objects for convenient heterogenous storage](#objects-for-convenient-heterogenous-storage)
 78 |   - [Parquet: A concise tabular data store](#parquet-a-concise-tabular-data-store)
 79 | - [Ch12. MapReduce in the cloud with Amazon's Elastic MapReduce](#ch12-mapreduce-in-the-cloud-with-amazons-elastic-mapreduce)
 80 |   - [Convenient cloud clusters with EMR](#convenient-cloud-clusters-with-emr)
 81 |     - [AWS EMR](#aws-emr)
 82 |       - [Starting EMR clusters with mrjob](#starting-emr-clusters-with-mrjob)
 83 |   - [Machine learning in the cloud with Spark on EMR](#machine-learning-in-the-cloud-with-spark-on-emr)
 84 |     - [Running machine learning algorithms on a truly large dataset](#running-machine-learning-algorithms-on-a-truly-large-dataset)
 85 |     - [EC2 instance types and clusters](#ec2-instance-types-and-clusters)
 86 |     - [Software available on EMR](#software-available-on-emr)
 87 | 
 88 | # Ch1. Introduction
 89 | Map and reduce style of programming:
 90 | - easily write parallel programs
 91 | - organize the code around two functions: `map` and `reduce`
 92 | 
 93 | > `MapReduce` = framework for parallel and distributed computing; `map` and `reduce` = style of programming that allows running the work in parallel with minimal rewriting and extend the work to distributed workflows
 94 | 
 95 | **Dask** -> another tool for managing large data without `map` and `reduce`
 96 | 
 97 | ## Procedural programming
 98 | Program Workflow
 99 | 1. Starts to run
100 | 2. issues an instruction
101 | 3. instruction is executed
102 | 4. repeat 2 and 3
103 | 5. finishes running
104 | 
105 | ## Parallel programming
106 | Program workflow
107 | 1. Starts to run
108 | 2. divides up the work into chunks of instructions and data
109 | 3. each chunk of work is executed independently
110 | 4. chunks of work are reassembled
111 | 5. finishes running
112 | 
113 | ![map_reduce](map_reduce.png)
114 | 
115 | > The `map` and `reduce` style is applicable everywhere, but its specific strengths are in areas where you may need to scale
116 | 
117 | ## The map function for transforming data
118 | - `map`: function to transform sequences of data from one type to another
119 | - Always retains the same number of objects in the output as were provided in the input
120 | - performs one-to-one transformations -> is a great way to transform data so it is more suitable for use
121 | 
122 | > Declarative programming: focuses on explaining the logic of the code and not on specifying low-level details -> scaling is natural, the logic stays the same
123 | 
124 | ## The reduce function for advanced transformations
125 | - `reduce`: transform a sequence of data into a data structure of any shape or size
126 | - MapReduce programming pattern relies on the `map` function to transform some data into another type of data and then uses the `reduce` function to combine that data
127 | - performs one-to-any transformations -> is a great way to assemble data into a final result
128 | 
129 | ## Distributed computing for speed and scale
130 | Extension of parallel computing in which the computer resource we are dedicating to work on each chunk of a given task is its own machine
131 | 
132 | ## Hadoop: A distributed framework for map and reduce
133 | - Designed as an open source implementation of Google's original MapReduce framework
134 | - Evolved into distributed computing software used widely by companies processing large amounts of data
135 | 
136 | ## Spark for high-powered map, reduce, and more
137 | - Something of a sucessor to the Apache Hadoop framework that does more of its work in memory instead of by writing to file
138 | - Can run more than 100x faster than Hadoop
139 | 
140 | ## AWS Elastic MapReduce (EMR) - Large datasets in the cloud
141 | - Popular way to implement Hadoop and Spark
142 | - tackle small problems with parallel programming as its cost effective
143 | - tackle large problems with parallel programming because we can procure as many resources as we need
144 | 
145 | # Ch2. Accelerating large dataset work: Map and parallel computing
146 | `map`'s primary capabilities:
147 | - Replace `for` loops
148 | - Transform data
149 | - `map` evaluates only when necessary, not when called -> generic `map` object as output
150 | 
151 | `map` makes easy to parallel code -> break into pieces
152 | 
153 | ## Pattern
154 | - Take a sequence of data
155 | - Transform it with a function
156 | - Get the outputs
157 | 
158 | > `Generators` instead of normal loops prevents storing all objects in memory in advance
159 | 
160 | ## Lazy functions for large datasets
161 | - `map` = lazy function = it doesn't evaluate when we call `map`
162 | - Python stores the instructions for evaluating the function and runs them at the exact moment we ask for the value
163 | - Common lazy objects in Python = `range` function
164 | - Lazy `map` allows us to transform a lot of data without an unnecessarily large amount of memory or spending the time to generate it
165 | 
166 | ## Parallel processing
167 | ### Problems
168 | #### Inability to pickle data or functions 
169 | - *Pickling*: Python's version of object serialization or mashalling
170 | - Storing objects from our code in an efficient binary format on the disk that can be read back by our program at a later time (`pickle` module) 
171 | - allows us to share data across procesors or even machines, saving the instructions and data and then executing them elsewhere
172 | - Objects we can't pickle: lambda functions, nested functions, nested classes
173 | - `pathos` and `dill` module allows us to pickle almost anything
174 | 
175 | #### Order-sensitive operations
176 | - Work in parallel: not guaranteed that tasks will be finished in the same order they're input
177 | - If work needs to be processed in a linear order -> probably shouldn't do it in parallel
178 | - Even though Python may not complete the problems in order, it still remembers the order in which it was supposed to do them -> `map` returns in the exact order we would expect, even if it doesn't process in that order
179 | 
180 | #### State-dependent operations
181 | - Common solution for the state problem: **take the internal state and make it an external variable**
182 | 
183 | ## Other observations
184 | - Best way to flatten a list into one big list -> Python's itertools `chain` function: takes an iterable of iterables and chains them together so they can all be accessed one after another -> lazy by default
185 | - Best way to visualize graphs is to take it out of Python and import it into Gephi: dedicated piece of graph visualization software
186 | 
187 | > Anytime we're converting a sequence of some type into a sequence of another type, what we're doing can be expressed as a map -> N-to-N transformation: we're converting N data elements, into N data elements but in different format
188 | 
189 | - To make this type of problem parallel only adds up to few lines of code:
190 |   - one import
191 |   - wrangling our processors with `Pool()`
192 |   - modifying our `map` statements to use `Pool.map` method
193 | 
194 | # Ch3. Function pipelines for mapping complex transformations
195 | 
196 | ## Helper functions and function chains
197 | **Helper functions**: small, simple functions that we rely on to do complex things -> break down large problems into small pieces that we can code quickly
198 | 
199 | **Function chains** or **pipelines**: the way we put helper functions to work
200 | 
201 | ### Creating a pipeline
202 | - Chaining helper functions together
203 | - Ways to do this:
204 |   - Using a sequence of maps
205 |   - Chaining functions together with `compose`
206 |   - Creating a function pipeline with `pipe`
207 | - `compose` and `pipe` are functions in the `toolz` package
208 | 
209 | #### Compose
210 | ```
211 | from toolz.functoolz import compose
212 | ```
213 | 
214 | - Pass `compose` all the functions we want to include in our pipeline
215 | - Pass in **reverse order** because `compose` is going to apply them backwards
216 | - Store the output of our `compose` function, which is itself a function, to a variable
217 | - Call that variable or pass it along to `map`
218 | 
219 | #### Pipe
220 | ```
221 | from toolz.functoolz import pipe
222 | ```
223 | 
224 | - `pipe` function will pass a value through a pipeline
225 | - `pipe` expects the functions to be in the order we want to apply them
226 | - `pipe` evaluates each of the functions and returns a results
227 | - If we want to pass it to `map`, we have to wrap it in a function definition
228 | 
229 | ## Summary
230 | > Major advantages of creating pipelines of helper functions are that the code becomes: **Readable and clear; Modular and easy to edit**
231 | 
232 | - Modular code play very nice with `map` and can readily move into parallel workflows, such as by using the `Pool()`
233 | - We can simplify working with nested data structures by using nested function pipelines, which we can apply with `map`
234 | 
235 | # Ch4. Processing large datasets with lazy workflows
236 | ## Laziness
237 | - *Lazy evaluation*: strategy when deciding when to perform computations
238 | - Under lazy evaluation, the Python interpreter executes lazy Python code only when the program needs the results of that code
239 | - Opposite of *eager evaluation*, where everything is evaluated when it's called
240 | 
241 | ### Shrinking sequences with the filter function
242 | - `filter`: function for pruning sequences. 
243 | - Takes a sequence and restricts it to only the elements that meet a given condition
244 | - Related functions to know
245 |   - `itertools.filterfalse`: get all the results that make a qualifier function return `False`
246 |   - `toolz.dicttoolz.keyfilter`: filter on the keys of a `dict`
247 |   - `toolz.dicttoolz.valfilter`: filter on the values of a `dict`
248 |   - `toolz.dicttoolz.itemfilter`: filter on both the keys and the values of a dict
249 | 
250 | ### Combining sequences with zip
251 | - `zip`: function for merging sequences. 
252 | - Takes two sequences and returns a single sequence of `tuples`, each of which contains an element from each of the original sequences
253 | - Behaves like a zipper, it interlocks the values of Python iterables
254 | 
255 | ### Lazy file searching with iglob
256 | - `iglob`: function for lazily reading from the filesystem. 
257 | - Lazy way of querying our filesystem
258 | - Find a sequence of files on our filesystem that match a given pattern
259 | 
260 | ```
261 | from glob import iglob
262 | posts = iglob("path/to/posts/2020/06/*.md")
263 | ```
264 | 
265 | ## Understanding iterators: the magic behind lazy Python
266 | - Replace data with instructions about where to find data and replace transformations with instructions for how to execute those transformations. 
267 | - The computer only has to concern itself with the data it is processing right now, as opposed to the data it just processed or has to process in the future
268 | - Iterators are the base class of all the Python data types that can be iterated over
269 | 
270 | > The iteration process is defined by a special method called `.__iter__()`. If a class has this method and returns an object with a `.__next__()` method, then we can iterate over it.
271 | 
272 | - One-way streets: once we call `next`, the item returned is removed from the sequence. We can never back up or retrieve that item again
273 | - Not meant for by-hand inspection -> meant for processing big data
274 | 
275 | ## Generators: functions for creating data
276 | - Class of functions in Python that lazily produce values in a sequence
277 | - We can create generators with functions using `yield` statements or through concise and powerful list comprehension-like generator expressions
278 | - They're a simple way of implementing an iterator
279 | - Primary advantage of generators and lazy functions: **avoiding storing more in memory than we need to**
280 | - `itertools.islice`: take chunks from a sequence
281 | 
282 | > Lazy functions are great at processing data, but hardware still limits how quickly we can work through it
283 | 
284 | - `toolz.frequencies`: takes a sequence in and returns a `dict` of items that occurred in the sequence as keys with corresponding values equal to the number of times they occurred -> provides the frequencies of items in our sequence
285 | 
286 | ## Simulations
287 | - For simulations -> writing classes allow us to consolidate the data about each piece of the simulation
288 | - `itertools.count()`: returns a generator that produces an infinite sequence of increasing numbers
289 | - Unzipping = the opposite of zipping -> takes a single sequence and returns two -> unzip = `zip(*my_sequence)`
290 | 
291 | > `operator.methodcaller`: takes a string and returns a function that calls that method with the name of that string on any object passed to it -> call class methods using functions is helpful = allows us to use functions like `map` and `filter` on them
292 | 
293 | # Ch5. Accumulation operations with reduce
294 | - `reduce`: function for N-to-X transformations
295 | - We have a sequence and want to transform it into something that we can't use `map` for
296 | - `map` can take care of the transformations in a very concise manner, whereas `reduce` can take care of the very final transformation
297 | 
298 | ## Three parts of reduce
299 | - **Accumulator function**
300 | - **Sequence**: object that we can iterate through, such as lists, strings, and generators
301 | - **Initializer**: initial value to be passed to our accumulator (may be *optional*) -> use an initalizer not when we want to change the value of our data, but when we want to change the *type* of the data
302 | 
303 | ```
304 | from functools import reduce
305 | 
306 | reduce(acc_fn, sequence, initializer)
307 | ```
308 | 
309 | ### Accumulator functions
310 | - Does the heavy lifting for `reduce`
311 | - Special type of helper function
312 | - Common prototype:
313 |   - take an accumulated value and the next element in the sequence
314 |   - return another object, typically of the same type as the accumulated value
315 |   - **accumulator functions always needs to return a value**
316 | - Accumulator functions take two variables: one for the accumulated data (often designated as acc, left, or a), and one for the next element in the sequence (designated nxt, right, or b).
317 | 
318 | ```
319 | def my_add(acc, nxt):
320 |     return acc + nxt
321 | 
322 | # or, using lambda functions
323 | lambda acc, nxt: acc + nxt
324 | ```
325 | 
326 | ## Reductions
327 | - `filter`
328 | - `frequencies`
329 | 
330 | ## Using map and reduce together
331 | > If you can decompose a problem into an N-to-X transformation, all that stands between you and a reduction that solves that problem is a well-crafted accumulation function
332 | 
333 | - Using `map` and `reduce` pattern to decouple the transformation logic from the actual transformation itself:
334 |   - leads to highly reusable code
335 |   - with large datasets -> simple functions becomes paramount -> we may have to wait a long time to discover we made a small error
336 | 
337 | ## Speeding up map and reduce
338 | > Using a parallel map can counterintuitively be slower than using a lazy map in map an reduce scenarios
339 | 
340 | - We can always use parallelization at the `reduce` level instead of at the `map` level
341 | 
342 | # Ch6. Speeding up map and reduce with advanced parallelization
343 | - Parallel `reduce`: use parallelization in the accumulation process instead of the transformation process
344 | 
345 | ## Getting the most out of parallel map
346 | Parallel `map` will be slower than lazy `map` when:
347 | - we're going to iterate through the sequence a second time later in our workflow
348 | - size of the work done in each parallel instance is small compared to the overhead that parallelization imposes -> *chunksize*: size of the different pieces into which we break our tasks for parallel processing
349 | - Python makes *chunksize* available as an option -> vary according to the task at hand
350 | 
351 | ### More parallel maps: `.imap` and `starmap`
352 | #### `.imap`
353 | - `.imap`: for lazy parallel mapping
354 | - use `.imap` method to work in parallel on very large sequences efficiently
355 | - Lazy and parallel? use the `.imap` and `.imap_unordered` methods of `Pool()` -> both methods return iterators instead of lists
356 | - `.imap_unordered`: behaves the same, except it doesn't necessarily put the sequence in the right order for our iterator
357 | 
358 | #### `starmap`
359 | - use `starmap` to work with complex iterables, especially those we're likely to create using the `zip` function -> more than one single parameter (map's limitation)
360 | - `starmap` unpacks `tuples` as **positional parameters** to the function with which we're mapping
361 | - `itertools.starmap`: lazy function
362 | - `Pool().starmap`: parallel function
363 | 
364 | ## Parallel reduce for faster reductions
365 | Parallel `reduce`:
366 | - break a problem into chunks
367 | - make no guarantees about order
368 | - need to pickle data
369 | - be finicky about stateful objects
370 | - run slower than its linear counterpart on small datasets
371 | - run faster than its linear counterpart on big datasets
372 | - require an accumulator function, some data, and an initial value
373 | - perform N-to-X transformations
374 | 
375 | > Parallel reduce has six parameters: an accumulation function, a sequence, an initializer value, a map, a chunksize, and a combination function - three more than the standard reduce function
376 | 
377 | Parallel `reduce` workflow:
378 | - break our problem into pieces
379 | - do some work
380 | - combine the work
381 | - return a result
382 | 
383 | > With parallel `reduce` we trade the simplicity of always having the same combination function for the flexibility of more possible transformations
384 | 
385 | Implementing parallel `reduce`:
386 | 1. Importing the proper classes and functions
387 | 2. Rounding up some processors
388 | 3. Passing our `reduce` function the right helper functions and variables
389 | 
390 | - Python doesn't natively support parallel `reduce` -> `pathos` library
391 | - `toolz.fold` -> parallel `reduce` implementation
392 | 
393 | > `toolz` library: functional utility library that Python never came with. High-performance version of the library = `CyToolz`
394 | 
395 | # Ch7. Processing truly big datasets with Hadoop and Spark
396 | - **Hadoop**: set of tools that support distributed map and reduce style of programming through Hadoop MapReduce
397 | - **Spark**: analytics toolkit designed to modernize Hadoop
398 | 
399 | ## Distributed computing
400 | - share tasks and data long-term across a network of computers
401 | - offers large benefits in speed when we can parallelize our work
402 | - challenges:
403 |   - keeping track of all our data
404 |   - coordinating our work
405 |   
406 | > If we distribute our work prematurely, we’ll end up losing performance spending too much time talking between computers and processors. A lot of performance improvements at the high-performance limits of distributed computing revolve around **optimizing communication between machines**
407 | 
408 | ## Hadoop five modules
409 | 1. *MapReduce*: way of dividing work into parallelizable chunks
410 | 2. *YARN*: scheduler and resource manager
411 | 3. *HDFS*: file system for Hadoop
412 | 4. *Ozone*: Hadoop extension for object storage and semantic computing
413 | 5. *Common*: set of utilities that are shared across the previous four modules
414 | 
415 | ### YARN for job scheduling
416 | - Scheduling
417 |   - Oversees all of the work that is being done
418 |   - Acts as a final decision maker in terms of how resources should be allocated across the cluster
419 | - Application management (*node managers*): work at the node (single-machine) level to determine how resources should be allocated within that machine
420 |   - *federation*: tie together resource managers in extremely high demand use cases where thousands of nodes are not sufficient
421 | 
422 | ### The data storage backbone of Hadoop: HDFS
423 | Hadoop Distributed File System (HDFS) -> reliable, performant foundation for high-performance distributed computing (but with that comes complexity). Use cases:
424 | - process big datasets
425 | - be flexible in hardware choice
426 | - be protected against hardware failure
427 | 
428 | > Moving code is faster than moving data
429 | 
430 | ### MapReduce jobs using Python and Hadoop Streaming
431 | Hadoop MapReduce with Python -> Hadoop Streaming = utility for using Hadoop MapReduce with programming languages besides Java
432 | 
433 | Hadoop natively supports compression data: .gz, .bz2, and .snappy
434 | 
435 | ## Spark for interactive workflows
436 | Analytics-oriented data processing framework designed to take advantage of higher-RAM compute clusters. Advantages for Python programmers:
437 | - direct Python interface - `PySpark`: allows for us to interactively explore big data through a PySpark shell REPL
438 | - can query SQL databases directly (Java Database Connectivity - JDBC)
439 | - has a *DataFrame* API: rows-and-columns data structure familiar to `pandas` -> provides a convenience layer on top of the core Spark data object: the RDD (Resilient Distributed Dataset)
440 | - Spark has two high-performance data structures: RDDs, which are excellent for any type of data, and DataFrames, which are optimized for tabular data.
441 | 
442 | Favor Spark over Hadoop when:
443 | - processing streaming data
444 | - need to get the task completed nearly instantaneously
445 | - willing to pay for high-RAM compute clusters
446 | 
447 | ### PySpark for mixing Python and Spark
448 | PySpark: we can call Spark's Scala methods through Python just like we would a normal Python library
449 | 
450 | # Ch8. Best practices for large data with Apache Streaming and mrjob
451 | Use Hadoop to process
452 | - lots of data fast: distributed parallelization
453 | - data that's important: low data loss
454 | - enormous amounts of data: petabyte scale
455 | 
456 | Drawbacks
457 | - To use Hadoop with Python -> Hadoop Streaming utility
458 | - Repeatedly read in string from `stdin`
459 | - Error messages for Java are not helpful
460 | 
461 | ## Unstructured data: Logs and documents
462 | - Hadoop creators designed Hadoop to work on *unstructured data* -> data in the form of documents
463 | - Unstructured data is notoriously unwieldly =/= tabular data
464 | - But, is one of the most common forms of data around
465 | 
466 | ## JSON for passing data between mapper and reducer
467 | - JavaScript Object Notation (JSON)
468 | - Data format used for moving data in plain text between one place and another
469 | - `json.dumps()` and `json.loads()` functions from Python's json library to achieve the transfer
470 | - Advantages:
471 |   - easy for humans and machines to read
472 |   - provides a number of useful basic data types (string, numeric, array)
473 |   - emphasis on key-value pairs that aids the loose coupling of systems
474 | 
475 | ## mrjob for pythonic Hadoop streaming
476 | - `mrjob`: Python library for Hadoop Streaming that focuses on cloud compatibility for truly scalable analysis
477 | - keeps the mapper and reducer steps but wraps them up in a single worker class named `mrjob`
478 | - `mrjob` versions of `map` and `reduce` share the same type signature, taking in keys and values and outputting keys and values
479 | - `mrjob` enforces JSON data exchange between the mapper and reducer phases, so we need to ensure that our output data is JSON serializable.
480 | 
481 | # Ch9. PageRank with map and reduce in PySpark
482 | PySpark's RDD class methods:
483 | - `map`-like methods: replicate the function of `map`
484 | - `reduce`-like methods: replicate the function of `reduce`
485 | - *Convenience methods*: solve common problems
486 | 
487 | > **Partitions** are the abstraction that RDDs use to implement parallelization. The data in an RDD is split up across different partitions, and each partition is handled in memory. It is common in large data tasks to partition an RDD by a key
488 | 
489 | ### Map-like methods in PySpark
490 | - `.map`
491 | - `.flatMap`
492 | - `.mapValues`
493 | - `.flatMapValues`
494 | - `. mapPartitions`
495 | - `.mapPartitionsWithIndex`
496 | 
497 | ### Reduce-like methods in PySpark
498 | - `.reduce`
499 | - `.fold`
500 | - `.aggregate` -> provides all the functionality of a parallel reduce. We can provide an initializer value, an aggregation function, and a combination function
501 | 
502 | ### Convenience methods in PySpark
503 | Many of these mirror functions in `functools`, `itertools` and `toolz`. Some examples:
504 | - .countByKey()
505 | - .countByValue()
506 | - .distinct()
507 | - .countApproxDistinct() 
508 | - .filter()
509 | - .first()
510 | - .groupBy()
511 | - .groupByKey()
512 | - .saveAsTextFile()
513 | - .take()
514 | 
515 | #### Saving RDDs to text files
516 | Excellent for a few reasons:
517 | - The data is in a human-readable, persistent format.
518 | - We can easily read this data back into Spark with the `.textFile` method of `SparkContext`.
519 | - The data is well structured for other parallel tools, such as Hadoop’s MapReduce.
520 | - We can specify a compression format for efficient data storage or transfer.
521 | 
522 | > RDD `.aggregate` method—returns a dict. We need an RDD so that we can take advantage of Spark’s parallelization. To get an RDD, we’ll need to explicitly convert the items of that dict into an RDD using the `.parallelize` method from our SparkContext: `sc`.
523 | 
524 | - Spark programs often use \ characters in their method chaining to increase their readability
525 | - Using the `byKey` variations of methods in PySpark often results in significant speed-ups because like data is worked on by the same distributed compute worker
526 | 
527 | # Ch10. Faster decision-making with machine learning and PySpark
528 | One of the reasons why Spark is so popular = built-in machine learning capabilities
529 | 
530 | PySpark’s machine learning capabilities live in a package called `ml`. This package itself contains a few different modules categorizing some of the core machine learning capabilities, including
531 | - `pyspark.ml.feature` — For feature transformation and creation
532 | - `pyspark.ml.classification` — Algorithms for judging the category in which a data point belongs
533 | - `pyspark.ml.tuning` — Algorithms for improving our machine learners
534 | - `pyspark.ml.evaluation` — Algorithms for evaluating machine leaners
535 | - `pyspark.ml.util` — Methods of saving and loading machine learners
536 | 
537 | > PySpark’s machine learning features expect us to have our data in a PySpark `DataFrame` object - not an `RDD`. The `RDD` is an abstract parallelizable data structure at the core of Spark, whereas the `DataFrame` is a layer on top of the `RDD` that provides a notion of rows and columns
538 | 
539 | ## Organizing the data for learning
540 | Spark's ml classifiers look for two columns in a `DataFrame`:
541 | - A `label` column: indicates the correct classification of the data
542 | - A `features` column: contains the features we're going to use to predict that label
543 | 
544 | ## Auxiliary classes
545 | - PySpark's `StringIndexer`: transforms categorical data stored as category names (using strings) and indexes the names as numerical variables. `StringIndexer` indexes categories in order of frequency — from most common to least common. The most common category will be 0, the second most common category 1, and so on
546 | - Most data structures in Spark are immutable -> property of Scala (in which Spark is written)
547 | - Spark's ml only want one column name `features` -> PySpark's `VectorAssembler`: `Transformer` like `StringIndexer` -> takes some input column names and an output column name and has methods to return a new `DataFrame` that has all the columns of the original, plus the new column we want to add
548 | - The feature creation classes are `Transformer`-class objects, and their methods return new `DataFrames`, rather than transforming them in place
549 | 
550 | ## Evaluation
551 | PySpark's `ml.evaluation` module:
552 | - `BinaryClassifierEvaluator`
553 | - `RegressionEvaluator`
554 | - `MulticlassClassificationEvaluator`
555 | 
556 | ## Cross-validation in PySpark
557 | `CrossValidator` class: k-fold cross-validation, needs to be initialized with:
558 | - An *estimator*
559 | - A *parameter estimator* - `ParamGridBuilder` object
560 | - An *evaluator*
561 | 
562 | # Ch11. Large datasets in the cloud with Amazon Web Services and S3
563 | S3 is the go-to service for large datasets:
564 | 1. *effectively unlimited storage capacity*. We never have to worry about our dataset becoming too large
565 | 2.  *cloud-based*. We can scale up and down quickly as necessary.
566 | 3.  *offers object storage*. We can focus on organizing our data with metadata and store many different types of data.
567 | 4.  *managed service*. Amazon Web Services takes care of a lot of the details for us, such as ensuring data availability and durability. They also take care of security patches and software updates.
568 | 5.  *supports versioning and life cycle policies*. We can use them to update or archive our data as it ages
569 | 
570 | ## Objects for convenient heterogenous storage
571 | - Object storage: storage pattern that focuses on the **what of the data instead of the where**
572 | - With object storage we recognize objects by a unique identifier (instead of the name and directory)
573 | - Supports arbitrary metadata -> we can tag our objects flexibly based on our needs (helps us find those objects later when we need to use them)
574 | - Querying tools are available for S3 that allow SQL-like querying on these metadata tags for metadata analysis
575 | - Unique identifiers -> we can store heterogenous data in the same way
576 | 
577 | ## Parquet: A concise tabular data store
578 | - CSV is a simple, tabular data store, and JSON is a human-readable document store. Both are common in data interchange and are often used in the storage of distributed large datasets. Parquet is a Hadoop- native tabular data format.
579 | - Parquet uses clever metadata to improve the performance of map and reduce operations. Running a job on Parquet can take as little as 1/100th the time a comparable job on a CSV or JSON file would take. Additionally, Parquet supports efficient compression. As a result, it can be stored at a fraction of the cost of CSV or JSON.
580 | - These benefits make Parquet an excellent option for data that primarily needs to be read by a machine, such as for batch analytics operations. JSON and CSV remain good options for smaller data or data that's likely to need some human inspection.
581 | 
582 | > Boto is a library that provides Pythonic access to many of the AWS APIs. We need the access key and secret key to programmatically access AWS through boto
583 | 
584 | # Ch12. MapReduce in the cloud with Amazon's Elastic MapReduce
585 | ## Convenient cloud clusters with EMR
586 | Ways to get access to a compute cluster that support both Hadoop and Spark:
587 | - AWS: Amazon's Elastic MapReduce
588 | - Microsoft's Azure HDInsight
589 | - Google's Cloud Dataproc
590 | 
591 | ### AWS EMR
592 | - AWS EMR is a managed data cluster service
593 | - We specify general properties of the cluster, and AWS runs software that creates the cluster for us
594 | - When we're done using the cluster, AWS absorbs the compute resources back into its network
595 | - Pricing model is a per-compute-unit per-second charge
596 | - **There are no cost savings to doing things slowly. AWS encourages us to parallelize our problems away**
597 | 
598 | #### Starting EMR clusters with mrjob
599 | - We can run Hadoop jobs on EMR with the `mrjob` library, which allows us to write distributed MapReduce and procure cluster computing in Python.
600 | - We can use `mrjob`'s configuration files to describe what we want our clusters to look like, including which instances we’d like to use, where we’d like those instances to be located, and any tags we may want to add.
601 | 
602 | > Hadoop on EMR is excellent for large data processing workloads, such as batch analytics or extract-transform-load (ETL)
603 | 
604 | ## Machine learning in the cloud with Spark on EMR
605 | - Hadoop is great for low-memory workloads and massive data. 
606 | - Spark is great for jobs that are harder to break down into map and reduce steps, and situations where we can afford higher memory machines
607 | 
608 | ### Running machine learning algorithms on a truly large dataset
609 | 1. Get a sample of the full dataset.
610 | 2. Train and evaluate a few models on that dataset.
611 | 3. Select some models to evaluate on the full dataset.
612 | 4. Train several models on the full dataset in the cloud.
613 | 
614 | > Run your Spark code with `spark-submit` utility instead of Python. The `spark-submit` utility queues up a Spark job, which will run in parallel locally and simulate what would happen if you ran the program on an active cluster
615 | 
616 | ### EC2 instance types and clusters
617 | - `M-series`: use for Hadoop and for testing Spark jobs
618 | - `C-series`: compute-heavy workloads such as Spark analytics, Batch Spark jobs
619 | - `R-series`: high-memory, use for streaming analytics
620 | 
621 | ### Software available on EMR
622 | - JupyterHub: cluster-ready version of Jupyter Notebook -> run interactive Spark and Hadoop jobs from a notebook environment
623 | - Hive: compile SQL code to Hadoop MapReduce jobs
624 | - Pig: compile *Pig-latin* (SQL-like) commands to run Hadoop MapReduce jobs
625 | 
626 | 


--------------------------------------------------------------------------------