├── utils
├── __init__.py
└── call_llm.py
├── requirements.txt
├── assets
└── banner.png
├── docs
├── llm_dev
│ └── static
│ │ ├── meme.jpg
│ │ ├── design.png
│ │ ├── Octomind.png
│ │ ├── cookbook.png
│ │ ├── abstraction.png
│ │ ├── code2tutorial.png
│ │ ├── langchain_doc.png
│ │ ├── lanchain_diagram.png
│ │ ├── pocketflow_logo.png
│ │ ├── reddit_complaint.png
│ │ ├── node_illustration.png
│ │ ├── pocketflow_github.png
│ │ ├── star-history-2025710.png
│ │ ├── star-history-pocketflow.png
│ │ └── langchain_dependency_error.jpeg
├── design.md
├── other
│ └── fed.md
├── llm
│ ├── lora.md
│ ├── adam.md
│ ├── attention.md
│ └── pretrain.md
├── ml
│ └── linear.md
├── stats
│ └── hypothesis.md
├── rl
│ ├── multi-armed bandits.md
│ └── n-step-bootstrapping.md
└── math
│ └── eigen.md
├── README.md
├── main.py
├── flow.py
├── nodes.py
└── .gitignore
/utils/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pocketflow>=0.0.1
--------------------------------------------------------------------------------
/assets/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/assets/banner.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/meme.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/meme.jpg
--------------------------------------------------------------------------------
/docs/llm_dev/static/design.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/design.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
AI Tutor for Learning
2 |
3 | > We use AI to make education addictive
4 |
5 | This is an ongoing project...
6 |
--------------------------------------------------------------------------------
/docs/llm_dev/static/Octomind.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/Octomind.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/cookbook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/cookbook.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/abstraction.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/abstraction.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/code2tutorial.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/code2tutorial.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/langchain_doc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/langchain_doc.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/lanchain_diagram.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/lanchain_diagram.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/pocketflow_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/pocketflow_logo.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/reddit_complaint.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/reddit_complaint.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/node_illustration.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/node_illustration.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/pocketflow_github.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/pocketflow_github.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/star-history-2025710.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/star-history-2025710.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/star-history-pocketflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/star-history-pocketflow.png
--------------------------------------------------------------------------------
/docs/llm_dev/static/langchain_dependency_error.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/The-Pocket/PocketFlow-Tutorial-Video-Generator/main/docs/llm_dev/static/langchain_dependency_error.jpeg
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | from flow import create_qa_flow
2 |
3 | # Example main function
4 | # Please replace this with your own main function
5 | def main():
6 | shared = {
7 | "question": "In one sentence, what's the end of universe?",
8 | "answer": None
9 | }
10 |
11 | qa_flow = create_qa_flow()
12 | qa_flow.run(shared)
13 | print("Question:", shared["question"])
14 | print("Answer:", shared["answer"])
15 |
16 | if __name__ == "__main__":
17 | main()
18 |
--------------------------------------------------------------------------------
/flow.py:
--------------------------------------------------------------------------------
1 | from pocketflow import Flow
2 | from nodes import GetQuestionNode, AnswerNode
3 |
4 | def create_qa_flow():
5 | """Create and return a question-answering flow."""
6 | # Create nodes
7 | get_question_node = GetQuestionNode()
8 | answer_node = AnswerNode()
9 |
10 | # Connect nodes in sequence
11 | get_question_node >> answer_node
12 |
13 | # Create flow starting with input node
14 | return Flow(start=get_question_node)
15 |
16 | qa_flow = create_qa_flow()
--------------------------------------------------------------------------------
/utils/call_llm.py:
--------------------------------------------------------------------------------
1 | from openai import OpenAI
2 | import os
3 |
4 | # Learn more about calling the LLM: https://the-pocket.github.io/PocketFlow/utility_function/llm.html
5 | def call_llm(prompt):
6 | client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "your-api-key"))
7 | r = client.chat.completions.create(
8 | model="gpt-4o",
9 | messages=[{"role": "user", "content": prompt}]
10 | )
11 | return r.choices[0].message.content
12 |
13 | if __name__ == "__main__":
14 | prompt = "What is the meaning of life?"
15 | print(call_llm(prompt))
16 |
--------------------------------------------------------------------------------
/nodes.py:
--------------------------------------------------------------------------------
1 | from pocketflow import Node
2 | from utils.call_llm import call_llm
3 |
4 | class GetQuestionNode(Node):
5 | def exec(self, _):
6 | # Get question directly from user input
7 | user_question = input("Enter your question: ")
8 | return user_question
9 |
10 | def post(self, shared, prep_res, exec_res):
11 | # Store the user's question
12 | shared["question"] = exec_res
13 | return "default" # Go to the next node
14 |
15 | class AnswerNode(Node):
16 | def prep(self, shared):
17 | # Read question from shared
18 | return shared["question"]
19 |
20 | def exec(self, question):
21 | # Call LLM to get the answer
22 | return call_llm(question)
23 |
24 | def post(self, shared, prep_res, exec_res):
25 | # Store the answer in shared
26 | shared["answer"] = exec_res
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Dependencies
2 | node_modules/
3 | vendor/
4 | .pnp/
5 | .pnp.js
6 |
7 | # Build outputs
8 | dist/
9 | build/
10 | out/
11 | *.pyc
12 | __pycache__/
13 |
14 | # Environment files
15 | .env
16 | .env.local
17 | .env.*.local
18 | .env.development
19 | .env.test
20 | .env.production
21 |
22 | # IDE - VSCode
23 | .vscode/*
24 | !.vscode/settings.json
25 | !.vscode/tasks.json
26 | !.vscode/launch.json
27 | !.vscode/extensions.json
28 |
29 | # IDE - JetBrains
30 | .idea/
31 | *.iml
32 | *.iws
33 | *.ipr
34 |
35 | # IDE - Eclipse
36 | .project
37 | .classpath
38 | .settings/
39 |
40 | # Logs
41 | logs/
42 | *.log
43 | npm-debug.log*
44 | yarn-debug.log*
45 | yarn-error.log*
46 |
47 | # Operating System
48 | .DS_Store
49 | Thumbs.db
50 | *.swp
51 | *.swo
52 |
53 | # Testing
54 | coverage/
55 | .nyc_output/
56 |
57 | # Temporary files
58 | *.tmp
59 | *.temp
60 | .cache/
61 |
62 | # Compiled files
63 | *.com
64 | *.class
65 | *.dll
66 | *.exe
67 | *.o
68 | *.so
69 |
70 | # Package files
71 | *.7z
72 | *.dmg
73 | *.gz
74 | *.iso
75 | *.jar
76 | *.rar
77 | *.tar
78 | *.zip
79 |
80 | # Database
81 | *.sqlite
82 | *.sqlite3
83 | *.db
84 |
85 | # Optional npm cache directory
86 | .npm
87 |
88 | # Optional eslint cache
89 | .eslintcache
90 |
91 | # Optional REPL history
92 | .node_repl_history
--------------------------------------------------------------------------------
/docs/design.md:
--------------------------------------------------------------------------------
1 | # Design Doc: Your Project Name
2 |
3 | > Please DON'T remove notes for AI
4 |
5 | ## Requirements
6 |
7 | > Notes for AI: Keep it simple and clear.
8 | > If the requirements are abstract, write concrete user stories
9 |
10 |
11 | ## Flow Design
12 |
13 | > Notes for AI:
14 | > 1. Consider the design patterns of agent, map-reduce, rag, and workflow. Apply them if they fit.
15 | > 2. Present a concise, high-level description of the workflow.
16 |
17 | ### Applicable Design Pattern:
18 |
19 | 1. Map the file summary into chunks, then reduce these chunks into a final summary.
20 | 2. Agentic file finder
21 | - *Context*: The entire summary of the file
22 | - *Action*: Find the file
23 |
24 | ### Flow high-level Design:
25 |
26 | 1. **First Node**: This node is for ...
27 | 2. **Second Node**: This node is for ...
28 | 3. **Third Node**: This node is for ...
29 |
30 | ```mermaid
31 | flowchart TD
32 | firstNode[First Node] --> secondNode[Second Node]
33 | secondNode --> thirdNode[Third Node]
34 | ```
35 | ## Utility Functions
36 |
37 | > Notes for AI:
38 | > 1. Understand the utility function definition thoroughly by reviewing the doc.
39 | > 2. Include only the necessary utility functions, based on nodes in the flow.
40 |
41 | 1. **Call LLM** (`utils/call_llm.py`)
42 | - *Input*: prompt (str)
43 | - *Output*: response (str)
44 | - Generally used by most nodes for LLM tasks
45 |
46 | 2. **Embedding** (`utils/get_embedding.py`)
47 | - *Input*: str
48 | - *Output*: a vector of 3072 floats
49 | - Used by the second node to embed text
50 |
51 | ## Node Design
52 |
53 | ### Shared Store
54 |
55 | > Notes for AI: Try to minimize data redundancy
56 |
57 | The shared store structure is organized as follows:
58 |
59 | ```python
60 | shared = {
61 | "key": "value"
62 | }
63 | ```
64 |
65 | ### Node Steps
66 |
67 | > Notes for AI: Carefully decide whether to use Batch/Async Node/Flow.
68 |
69 | 1. First Node
70 | - *Purpose*: Provide a short explanation of the node’s function
71 | - *Type*: Decide between Regular, Batch, or Async
72 | - *Steps*:
73 | - *prep*: Read "key" from the shared store
74 | - *exec*: Call the utility function
75 | - *post*: Write "key" to the shared store
76 |
77 | 2. Second Node
78 | ...
79 |
80 |
--------------------------------------------------------------------------------
/docs/other/fed.md:
--------------------------------------------------------------------------------
1 | # The Physics of Money: Valuation, Employment, and the Federal Funds Rate ($r$)
2 |
3 | ## Introduction: The Algorithm of Value
4 |
5 | Why does a 0.25% change in a generic bank rate cause the stock market to crash?
6 | Why does this same number determine if you can buy a house or if your company cancels your yearly bonus?
7 | Why is the entire global economy sensitive to a single variable?
8 |
9 | The Federal Funds Rate ($r$) is the "gravity" of the financial world. It is the input variable that determines the weight of every dollar, every debt, and every asset on Earth.
10 |
11 | ### The Promise
12 | By the end of this tutorial (approx. 40 minutes), you will understand the mechanical relationship between the Federal Reserve and your wallet. You will not just learn "high interest is bad for borrowing." You will learn the specific valuation algorithms used by Wall Street and Corporations to price the world.
13 |
14 | You will master these formal concepts:
15 |
16 | * **The Master Formula:** $PV = \frac{CF}{(1+r)^n}$
17 | * **Discounted Cash Flow (DCF):** The math behind stock prices.
18 | * **Net Present Value (NPV):** The logic behind hiring and firing.
19 | * **Opportunity Cost:** The mathematical floor for investment.
20 |
21 | ### The Syllabus: From Math to Reality
22 |
23 | We will break down the economy into four variables.
24 |
25 | 1. **Your Debt (Leverage):**
26 | * **Concept:** Amortization mechanics.
27 | * **Reality:** Why a 4% rate increase doubles your monthly interest.
28 | 2. **Your Assets (Valuation):**
29 | * **Concept:** Discounted Cash Flow ($PV = \frac{CF}{(1+r)}$).
30 | * **Reality:** Why "safe" money kills "risky" tech stocks.
31 | 3. **Your Job (Hurdle Rates):**
32 | * **Concept:** Net Present Value (NPV).
33 | * **Reality:** Why companies stop hiring even if they are profitable.
34 | 4. **Your Cash (The Thermostat):**
35 | * **Concept:** Monetary Transmission Mechanism.
36 | * **Reality:** How the Fed intentionally slows the economy to fight inflation.
37 |
38 | **The core thesis:** The interest rate ($r$) is the denominator in the equation of value. As $r$ increases, the value of everything else decreases.
39 |
40 | Let’s begin with the most direct impact: Debt.
41 |
42 | ## Section 1: Your Monthly Budget (The Price of Leverage)
43 |
44 | The most immediate impact of the Federal Funds Rate ($r$) is on the price of **Leverage**. Most individuals and companies do not buy assets with cash; they buy them with debt.
45 |
46 | ### The Problem: The Monthly Payment Cap
47 | You want to buy a house. You take out a **\$300,000** mortgage.
48 | Most people think the price of the house is \$300,000. It is not. The price is the **monthly payment**.
49 | You have a budget limit of what you can pay per month. Let's see how $r$ breaks this limit.
50 |
51 | ### Concrete Solution: 3% vs 7%
52 | Let’s compare the interest costs in the first year.
53 |
54 | **Case A: The Low-Rate World ($r = 3\%$)**
55 | $$300,000 \times 0.03 = \$9,000 \text{ per year}$$
56 | **Monthly Interest cost:** $\approx \$750$
57 |
58 | **Case B: The High-Rate World ($r = 7\%$)**
59 | $$300,000 \times 0.07 = \$21,000 \text{ per year}$$
60 | **Monthly Interest cost:** $\approx \$1,750$
61 |
62 | **The Result:**
63 | The difference is **\$1,000 per month**. You get the exact same house, the exact same bricks, and the exact same land. But the cost of "renting the money" to buy it has more than doubled.
64 |
65 | ### Abstraction: Debt Service
66 | This concept is called **Debt Service**.
67 | When $r$ is low, money is cheap. You can service a large debt pile with a small income.
68 | When $r$ is high, money is expensive. Your buying power collapses.
69 |
70 | If your budget maxes out at \$2,000/month:
71 | * At 3%, you can afford the house easily.
72 | * At 7%, you are mathematically insolvent.
73 |
74 | ### Formalization: The Amortization Algorithm
75 | While the simple math above is a good approximation, banks use the **Amortization Formula** to calculate the exact fixed monthly payment ($M$).
76 |
77 | $$ M = P \frac{r(1+r)^n}{(1+r)^n - 1} $$
78 |
79 | **Variables:**
80 | * $M$: Total monthly payment
81 | * $P$: Principal loan amount (\$300,000)
82 | * $r$: Monthly interest rate (Annual rate / 12)
83 | * $n$: Number of payments (30 years = 360 months)
84 |
85 | **The Input/Output Table:**
86 |
87 | | Loan Amount ($P$) | Annual Rate ($r$) | Monthly Payment ($M$) | Total Interest Paid (30 yrs) |
88 | | :--- | :--- | :--- | :--- |
89 | | \$300,000 | **3.0%** | **\$1,265** | \$155,332 |
90 | | \$300,000 | **7.0%** | **\$1,996** | **\$418,527** |
91 |
92 | Notice the last column. At 7%, you pay more in interest (\$418k) than the house is actually worth (\$300k).
93 |
94 | ### The Reality Check
95 | We just looked at one mortgage. Now scale this to the entire economy.
96 |
97 | When the Fed raises $r$:
98 | 1. **Auto Loans:** A \$40,000 car becomes unaffordable.
99 | 2. **Credit Cards:** Variable rates jump from 15% to 25%.
100 | 3. **Housing Freeze:** People with 3% mortgages refuse to sell because they can't afford a new mortgage at 7%. Supply vanishes.
101 |
102 | **Key Takeaway:** The Fed controls the price of entering the game. When $r$ goes up, fewer players can afford to play.
103 |
104 | ## Section 2: Your Stock Portfolio (Asset Gravity)
105 |
106 | Why does the stock market turn red the moment the Fed announces a rate hike? Even if the companies are still making money?
107 |
108 | It comes down to a single law of finance: **The value of an asset is the present value of its future cash flows.**
109 |
110 | ### The Problem: What is a stock worth?
111 | Imagine a theoretical company, "CashFlow Inc."
112 | It pays you a guaranteed dividend of **\$10 per year** forever.
113 | What should you pay for one share of this company today?
114 |
115 | The answer depends entirely on the Fed rate ($r$).
116 |
117 | ### Concrete Solution: The Valuation Shift
118 | Investors use the Fed rate as the "Risk-Free Rate"—the baseline they could earn just by sitting on cash. If they buy a stock, they demand a return *higher* than the bank rate.
119 |
120 | **Case A: The Low-Rate World ($r = 5\%$)**
121 | You want to earn \$10 a year. The bank only pays 5%.
122 | To get \$10 from the bank, you would need to deposit \$200 ($200 \times 0.05 = 10$).
123 | Therefore, "CashFlow Inc." is worth **\$200**.
124 |
125 | **Case B: The High-Rate World ($r = 10\%$)**
126 | The Fed raises rates. Now the bank pays 10%.
127 | To get \$10 from the bank, you only need to deposit \$100 ($100 \times 0.10 = 10$).
128 | Therefore, "CashFlow Inc." is now only worth **\$100**.
129 |
130 | **The Result:**
131 | The company did not change. It still makes \$10/year. But the *price* of the stock crashed by **50%** (from \$200 to \$100) purely because $r$ doubled.
132 |
133 | ### Abstraction: The Denominator Effect
134 | This logic applies to everything: real estate, bonds, and stocks.
135 | As interest rates rise, the "required return" rises. Because the cash flow (\$10) is fixed, the only thing that can move is the Price.
136 |
137 | **Rule:** Price and Rates move in opposite directions.
138 |
139 | ### Formalization: Discounted Cash Flow (DCF)
140 | Wall Street formalizes this using the DCF model. The most basic version (for a perpetuity) is:
141 |
142 | $$ PV = \frac{CF}{r} $$
143 |
144 | **Variables:**
145 | * $PV$: Present Value (The Stock Price)
146 | * $CF$: Cash Flow (The Dividend/Profit)
147 | * $r$: The Discount Rate (Fed Rate + Risk Premium)
148 |
149 | **Visualizing the Decay:**
150 | If we graph the Price ($PV$) as $r$ increases, it creates a steep downward curve.
151 |
152 | ```text
153 | Price ($)
154 | |
155 | | * ($200 @ 5%)
156 | |
157 | | *
158 | | * ($100 @ 10%)
159 | | *
160 | | * ($66 @ 15%)
161 | |____________________ Rate (r)
162 | ```
163 |
164 | ### The Reality Check
165 | We just looked at a company paying \$10 *today*.
166 | Real life is more volatile.
167 |
168 | 1. **Tech Stocks (Growth):** Many tech companies (like startups) promise huge money *in the future* (10 years from now), not today.
169 | * When $r$ is high, money 10 years from now is nearly worthless today.
170 | * **Result:** This is why NASDAQ/Tech crashes harder than Coca-Cola when rates rise.
171 | 2. **Real Estate:** Investors look at rental income vs. the mortgage rate. If the mortgage rate ($r$) is higher than the rental yield ($CF$), the property value ($PV$) must come down to compensate.
172 |
173 | **Key Takeaway:** The Fed Rate is the denominator. If you increase the denominator, the resulting value **must** decrease. No exceptions.
174 |
175 | ## Section 3: Your Job Security (The Hurdle Rate)
176 |
177 | This is where the math gets personal. Why do companies announce layoffs or hiring freezes when interest rates go up—even if the company is profitable?
178 |
179 | It is because of **Opportunity Cost**. Companies do not invest in projects (or people) just to make a profit. They invest to make a profit *better than* what they could get for doing nothing.
180 |
181 | ### The Problem: Is this project worth it?
182 | Imagine you work for a company. You propose a new project (or a new team hire).
183 | * **Cost today:** \$100 (Salaries, Equipment)
184 | * **Payoff in 1 year:** \$110 (Revenue)
185 | * **Profit:** \$10.
186 |
187 | Is this a good idea? The answer is: **It depends on $r$.**
188 |
189 | ### Concrete Solution: The Safe Alternative
190 | The company has a choice: Invest \$100 in your project (risky) or put \$100 in "Risk-Free" Treasury Bonds (safe).
191 |
192 | **Scenario A: Fed Rate is Low ($r = 2\%$)**
193 | * **Safe Option:** \$100 grows to **\$102**.
194 | * **Your Project:** \$100 grows to **\$110**.
195 | * **Decision:** Your project beats the bank (\$110 > \$102).
196 | * **Result:** The company approves the budget. **You get hired.**
197 |
198 | **Scenario B: Fed Rate is High ($r = 12\%$)**
199 | * **Safe Option:** \$100 grows to **\$112**.
200 | * **Your Project:** \$100 grows to **\$110**.
201 | * **Decision:** Your project LOSES to the bank (\$110 < \$112). Even though your project makes a profit, it is *inefficient*.
202 | * **Result:** The company kills the project. **You get laid off.**
203 |
204 | ### Abstraction: The Hurdle Rate
205 | The Fed Rate acts as a **Hurdle Rate**.
206 | Think of it like a high jump bar.
207 | * When rates are 0%, the bar is on the floor. Almost any idea gets funded (Hello, Crypto startups and unprofitable Tech unicorns).
208 | * When rates are 5%, the bar is set high. Only the strongest, most profitable projects clear the jump. Everything else hits the bar and fails.
209 |
210 | ### Formalization: Net Present Value (NPV)
211 | Corporations formalize this logic using **Net Present Value**. This formula determines if an investment creates value or destroys it.
212 |
213 | $$ NPV = \frac{R}{(1+r)} - I $$
214 |
215 | **Variables:**
216 | * $I$: Initial Investment (\$100)
217 | * $R$: Return in one year (\$110)
218 | * $r$: The Interest Rate
219 |
220 | **The Calculation:**
221 | * **If $r = 2\%$:**
222 | $$ NPV = \frac{110}{1.02} - 100 \approx 107.8 - 100 = \mathbf{+\$7.80} $$
223 | **(Positive NPV = GO / HIRE)**
224 |
225 | * **If $r = 12\%$:**
226 | $$ NPV = \frac{110}{1.12} - 100 \approx 98.2 - 100 = \mathbf{-\$1.80} $$
227 | **(Negative NPV = NO GO / FIRE)**
228 |
229 | ### The Reality Check
230 | We just analyzed a \$100 project. Now look at the global economy.
231 |
232 | 1. **Venture Capital:** When rates are high, VC firms stop funding "growth at all costs" startups because the math (NPV) turns negative.
233 | 2. **Construction:** A developer won't build a new apartment complex if the loan interest is higher than the expected rent profit.
234 | 3. **The Result:** Less construction + less funding = fewer jobs.
235 |
236 | **Key Takeaway:** The Fed raises rates to intentionally make projects "fail" the NPV test. This forces companies to spend less and hire fewer people, cooling down the economy.
237 |
238 | ## Section 4: Your Cash & Inflation (The Thermostat)
239 |
240 | We have seen how $r$ destroys the value of stocks and makes debts expensive. Why does the Fed inflict this pain on purpose?
241 |
242 | They are acting as a **Thermostat**.
243 | * If the economy runs too hot (Inflation), they raise rates to cool it down.
244 | * If the economy freezes (Recession), they cut rates to heat it up.
245 |
246 | The mechanism works by changing your incentive to spend versus save.
247 |
248 | ### The Problem: The "Hot" Economy
249 | Inflation happens when there is "too much money chasing too few goods."
250 | To stop prices from rising, the Fed needs you to **stop spending money**.
251 | How do they convince you to stop spending? They pay you to wait.
252 |
253 | ### Concrete Solution: The Savings Incentive
254 | Imagine you have **\$1,000** in cash. You can buy a new TV today, or put it in a savings account for 10 years.
255 |
256 | **Case A: The Low-Rate World ($r = 1\%$)**
257 | * **Annual Interest:** $1,000 \times 0.01 = \$10$.
258 | * **10 Year Payoff:** You make roughly \$100 total.
259 | * **Your Decision:** A \$10 reward is pathetic. You buy the TV today.
260 | * **Macro Effect:** High spending $\rightarrow$ High demand $\rightarrow$ **High Inflation**.
261 |
262 | **Case B: The High-Rate World ($r = 5\%$)**
263 | * **Annual Interest:** $1,000 \times 0.05 = \$50$.
264 | * **10 Year Payoff:** Thanks to compound interest, this grows significantly.
265 | * **Your Decision:** You realize you can turn \$1,000 into roughly \$1,600 without working. You put the money in the bank. You **do not** buy the TV.
266 | * **Macro Effect:** Low spending $\rightarrow$ Low demand $\rightarrow$ **Inflation Drops**.
267 |
268 | ### Abstraction: Aggregate Demand
269 | This is the **Time Value of Money** in action.
270 | * **Low Rates:** Penalize savers. Encourage spending/speculation.
271 | * **High Rates:** Reward savers. Discourage spending.
272 |
273 | When millions of people simultaneously decide to "wait and save" instead of "buy now," stores have fewer customers. To get customers back, stores must lower prices. **That is how inflation is cured.**
274 |
275 | ### Formalization: Future Value (FV)
276 | We calculate the reward for waiting using the Compound Interest Formula:
277 |
278 | $$ FV = PV (1+r)^n $$
279 |
280 | **Variables:**
281 | * $FV$: Future Value (Money you have later)
282 | * $PV$: Present Value (Money you have now)
283 | * $r$: Interest Rate
284 | * $n$: Time (Years)
285 |
286 | **The Difference of 4%:**
287 | Over 10 years ($n=10$):
288 | * At 1%: $1,000(1.01)^{10} = \mathbf{\$1,104}$
289 | * At 5%: $1,000(1.05)^{10} = \mathbf{\$1,628}$
290 |
291 | That extra \$500 is the "bribe" the Fed pays you to remove your cash from the economy.
292 |
293 | ### The Mechanism: The Feedback Loop
294 | This brings all the previous sections together into one system. This is the **Monetary Transmission Mechanism**.
295 |
296 | ```mermaid
297 | graph TD
298 | A[Fed Raises Rates] --> B[Borrowing is Expensive]
299 | A --> C[Asset Prices Fall]
300 | A --> D[Saving is Rewarded]
301 |
302 | B --> E[Less Spending]
303 | C --> E
304 | D --> E
305 |
306 | E --> F[Companies Sell Less]
307 | F --> G[Companies Hire Less]
308 |
309 | G --> H[LOWER INFLATION]
310 | ```
311 |
312 | ### The Reality Check
313 | This mechanism is blunt. The Fed cannot target just "egg prices." They have to slow down the *entire* system.
314 |
315 | 1. **The "Soft Landing":** The goal is to raise rates *just enough* to stop inflation, but not so much that everyone loses their job (Section 3).
316 | 2. **The "Hard Landing" (Recession):** If they raise rates too high, debt becomes impossible to pay (Section 1), the stock market collapses (Section 2), and mass layoffs occur (Section 3).
317 |
318 | **Key Takeaway:** The Fed controls inflation by manipulating your psychology. They raise rates to make you feel poorer (lower stock prices) and make saving look smarter than spending. When you close your wallet, prices fall.
319 |
320 | ## Conclusion: The Universal Denominator
321 |
322 | By now, you should see that the Federal Funds Rate ($r$) is not just a number on a news ticker. It is the **Universal Denominator**.
323 |
324 | In math, when you increase the denominator of a fraction, the total value of the number gets smaller.
325 | $$ \text{Value} = \frac{\text{Cash Flow}}{\mathbf{r}} $$
326 |
327 | Because $r$ is in the denominator of the entire global economy, the Fed has the power to resize reality.
328 |
329 | ### Summary: The Matrix of Money
330 |
331 | Let's review the four levers we discussed. This is your cheat sheet for how the world works:
332 |
333 | | Area of Life | The Concept | When Rates ($r$) Go Up... | The Result |
334 | | :--- | :--- | :--- | :--- |
335 | | **Your Debt** | **Debt Service** | Payments explode (Math: Amortization) | You can afford less house/car. |
336 | | **Your Assets** | **DCF** | Future cash is worth less (Math: $PV = CF/r$) | Stock and Real Estate prices fall. |
337 | | **Your Job** | **NPV / Hurdle Rate** | Projects fail the profit test | Companies hire less or layoff staff. |
338 | | **Your Cash** | **Opportunity Cost** | Savings yield high returns | You stop spending, inflation falls. |
339 |
340 | Now you know: **There is no such thing as a fixed price.**
341 | A price is just a reflection of the current interest rate.
342 |
343 | * If you are waiting for house prices to drop, you are actually waiting for rates to rise (which makes the mortgage expensive).
344 | * If you are waiting for the stock market to moon, you are waiting for rates to fall (which creates inflation risk).
345 |
346 | You now possess the source code. The next time the Fed Chair walks up to the podium, you won't just hear a percentage change. You will see the valuation variables in your own life being rewritten in real-time.
--------------------------------------------------------------------------------
/docs/llm/lora.md:
--------------------------------------------------------------------------------
1 | # **Tutorial Outline: LoRA for LLMs From Scratch**
2 |
3 | ## **Chapter 1: The Promise - Master LoRA in 30 Minutes**
4 |
5 | You've heard of LoRA. It's the key to fine-tuning massive LLMs on a single GPU. You've seen the acronyms: PEFT, low-rank adaptation. But what is it, *really*?
6 |
7 | It's not a complex theory. It's a simple, elegant trick.
8 |
9 | Instead of training a 1-billion-parameter weight matrix `W`, you freeze it. You then train two tiny matrices, `A` and `B`, that represent the *change* to `W`.
10 |
11 | This is LoRA. It's this piece of code:
12 |
13 | ```python
14 | import torch
15 | import torch.nn as nn
16 | import torch.nn.functional as F
17 | import math
18 |
19 | class LoRALinear(nn.Module):
20 | def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
21 | super().__init__()
22 | self.r = r
23 | self.alpha = alpha
24 | self.scaling = self.alpha / self.r
25 |
26 | # Freeze the original linear layer
27 | self.base = base
28 | self.base.weight.requires_grad_(False)
29 |
30 | # Create the trainable low-rank matrices
31 | self.lora_A = nn.Parameter(torch.empty(r, base.in_features))
32 | self.lora_B = nn.Parameter(torch.empty(base.out_features, r))
33 |
34 | # Initialize the weights
35 | nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
36 | nn.init.zeros_(self.lora_B) # Start with no change
37 |
38 | def forward(self, x: torch.Tensor) -> torch.Tensor:
39 | # Original path (frozen) + LoRA path (trainable)
40 | return self.base(x) + (F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling)
41 |
42 | ```
43 |
44 | **My promise:** You will understand every line of this code, the math behind it, and why it's so effective, in the next 30 minutes. Let's begin.
45 |
46 | ## **Chapter 2: The Foundation - The `nn.Linear` Layer**
47 |
48 | Before we can modify an LLM, we must understand its most fundamental part: the `nn.Linear` layer. It's the simple workhorse that performs the vast majority of computations in a Transformer.
49 |
50 | Its only job is to perform this equation: `output = input @ W.T + b`
51 |
52 | #### A Minimal, Reproducible Example
53 |
54 | Let's see this in action. We'll create a tiny linear layer that takes a vector of size 3 and outputs a vector of size 2. To make this perfectly clear, we will set the weights and bias manually.
55 |
56 | **1. Setup the layer and input:**
57 |
58 | ```python
59 | import torch
60 | import torch.nn as nn
61 |
62 | # A layer that maps from 3 features to 2 features
63 | layer = nn.Linear(in_features=3, out_features=2, bias=True)
64 |
65 | # A single input vector (with a batch dimension of 1)
66 | input_tensor = torch.tensor([[1., 2., 3.]])
67 |
68 | # Manually set the weights and bias for a clear example
69 | with torch.no_grad():
70 | layer.weight = nn.Parameter(torch.tensor([[0.1, 0.2, 0.3],
71 | [0.4, 0.5, 0.6]]))
72 | layer.bias = nn.Parameter(torch.tensor([0.7, 0.8]))
73 |
74 | ```
75 |
76 | **2. Inspect the Exact Components:**
77 |
78 | Now we have known values for everything.
79 |
80 | * **Input `x`:** `[1., 2., 3.]`
81 | * **Weight `W`:** `[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]`
82 | * **Bias `b`:** `[0.7, 0.8]`
83 |
84 | **3. The Forward Pass and Its Output:**
85 |
86 | When you call `layer(input_tensor)`, PyTorch computes the result.
87 |
88 | ```python
89 | # The forward pass
90 | output_tensor = layer(input_tensor)
91 |
92 | print("--- PyTorch Calculation ---")
93 | print("Input (x):", input_tensor)
94 | print("Weight (W):\n", layer.weight)
95 | print("Bias (b):", layer.bias)
96 | print("\nOutput (y):", output_tensor)
97 | ```
98 |
99 | This will print:
100 |
101 | ```text
102 | --- PyTorch Calculation ---
103 | Input (x): tensor([[1., 2., 3.]])
104 | Weight (W):
105 | tensor([[0.1000, 0.2000, 0.3000],
106 | [0.4000, 0.5000, 0.6000]], grad_fn=)
107 | Bias (b): tensor([0.7000, 0.8000], grad_fn=)
108 |
109 | Output (y): tensor([[2.1000, 4.7000]], grad_fn=)
110 | ```
111 | The final output is the tensor `[[2.1, 4.7]]`.
112 |
113 | **4. Manual Verification: Step-by-Step**
114 |
115 | Let's prove this result. The calculation is `x @ W.T + b`.
116 |
117 | * **First, the matrix multiplication `x @ W.T`:**
118 | * `[1, 2, 3] @ [[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]`
119 | * `output[0] = (1*0.1) + (2*0.2) + (3*0.3) = 0.1 + 0.4 + 0.9 = 1.4`
120 | * `output[1] = (1*0.4) + (2*0.5) + (3*0.6) = 0.4 + 1.0 + 1.8 = 3.2`
121 | * Result: `[1.4, 3.2]`
122 |
123 | * **Second, add the bias `+ b`:**
124 | * `[1.4, 3.2] + [0.7, 0.8]`
125 | * Result: `[2.1, 4.7]`
126 |
127 | The manual calculation matches the PyTorch output exactly. This is all a linear layer does.
128 |
129 | #### The Scaling Problem
130 |
131 | This seems trivial. So where is the problem? The problem is scale.
132 |
133 | * **Our Toy Layer (`3x2`):**
134 | * Weight parameters: `3 * 2 = 6`
135 | * Bias parameters: `2`
136 | * **Total:** `8` trainable parameters.
137 |
138 | * **A Single LLM Layer (e.g., `4096x4096`):**
139 | * Weight parameters: `4096 * 4096 = 16,777,216`
140 | * Bias parameters: `4096`
141 | * **Total:** `16,781,312` trainable parameters.
142 |
143 | A single layer in an LLM can have over **16 million** parameters. A full model has dozens of these layers. Trying to update all of them during fine-tuning is what melts GPUs. This is the bottleneck LoRA is designed to break.
144 |
145 | ## **Chapter 3: The LoRA Method - Math and Astonishing Savings**
146 |
147 | This is the core idea. Instead of changing the massive weight matrix $W$, we freeze it and learn a tiny "adjustment" matrix, $\Delta W$.
148 |
149 | The new, effective weight matrix, $W_{eff}$, is a simple sum:
150 |
151 | $W_{eff} = W_{frozen} + \Delta W$
152 |
153 | Training the full $\Delta W$ would be too expensive. The breakthrough of LoRA is to force this change to be **low-rank**, meaning we can construct it from two much smaller matrices, $A$ and $B$. We also add a scaling factor, $\frac{\alpha}{r}$, where $r$ is the rank and $\alpha$ is a hyperparameter.
154 |
155 | The full LoRA update is defined by this formula:
156 |
157 | $\Delta W = \frac{\alpha}{r} B A$
158 |
159 | #### A Step-by-Step Numerical Example
160 |
161 | Let's build a tiny LoRA update from scratch.
162 |
163 | **Given:**
164 | * A frozen weight matrix $W_{frozen}$ of shape `[out=4, in=3]`.
165 | * A LoRA rank $r=2$.
166 | * A scaling factor $\alpha=4$.
167 |
168 | $W_{frozen} = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{pmatrix}$
169 |
170 | Now, we define our trainable LoRA matrices, $A$ and $B$:
171 | * $A$ must have shape `[r, in]`, so `[2, 3]`.
172 | * $B$ must have shape `[out, r]`, so `[4, 2]`.
173 |
174 | Let's assume after training they have these values:
175 |
176 | $A = \begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix} \quad B = \begin{pmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix}$
177 |
178 | **Step 1: Calculate the core update, $B A$**
179 |
180 | This is a standard matrix multiplication. The result will have the same shape as $W_{frozen}$.
181 |
182 | $B A = \begin{pmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix} = \begin{pmatrix} (1*1+0*0) & (1*0+0*3) & (1*2+0*0) \\ (0*1+0*0) & (0*0+0*3) & (0*2+0*0) \\ (0*1+2*0) & (0*0+2*3) & (0*2+2*0) \\ (1*1+1*0) & (1*0+1*3) & (1*2+1*0) \end{pmatrix} = \begin{pmatrix} 1 & 0 & 2 \\ 0 & 0 & 0 \\ 0 & 6 & 0 \\ 1 & 3 & 2 \end{pmatrix}$
183 |
184 | **Step 2: Apply the scaling factor, $\frac{\alpha}{r}$**
185 |
186 | Our scaling factor is $\frac{4}{2} = 2$. We multiply our result by this scalar.
187 |
188 | $\Delta W = 2 \times \begin{pmatrix} 1 & 0 & 2 \\ 0 & 0 & 0 \\ 0 & 6 & 0 \\ 1 & 3 & 2 \end{pmatrix} = \begin{pmatrix} 2 & 0 & 4 \\ 0 & 0 & 0 \\ 0 & 12 & 0 \\ 2 & 6 & 4 \end{pmatrix}$
189 |
190 | This $\Delta W$ matrix is the total change that our LoRA parameters will apply to the frozen weights.
191 |
192 | **Step 3: The "Merge" for Inference**
193 |
194 | After training is done, we can create the final, effective weight matrix by adding the frozen weights and the LoRA update.
195 |
196 | $W_{eff} = W_{frozen} + \Delta W = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{pmatrix} + \begin{pmatrix} 2 & 0 & 4 \\ 0 & 0 & 0 \\ 0 & 12 & 0 \\ 2 & 6 & 4 \end{pmatrix} = \begin{pmatrix} 3 & 1 & 5 \\ 2 & 2 & 2 \\ 3 & 15 & 3 \\ 6 & 10 & 8 \end{pmatrix}$
197 |
198 | This final $W_{eff}$ matrix is what you would use for deployment. **Crucially, this merge calculation happens only once after training.** For inference, it's just a standard linear layer, adding zero extra latency.
199 |
200 | #### The Forward Pass (How it works during training)
201 |
202 | During training, we never compute the full $\Delta W$. That would be inefficient. Instead, we use the decomposed form, which is much faster. The forward pass is:
203 |
204 | $y = W_{frozen}x + \frac{\alpha}{r} B(Ax)$
205 |
206 | Let's compute this with an input $x = \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}$:
207 |
208 | 1. **LoRA Path (right side):**
209 | * `Ax =` $\begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} = \begin{pmatrix} (1*1+0*2+2*3) \\ (0*1+3*2+0*3) \end{pmatrix} = \begin{pmatrix} 7 \\ 6 \end{pmatrix}$
210 | * `B(Ax) =` $\begin{pmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 7 \\ 6 \end{pmatrix} = \begin{pmatrix} (1*7+0*6) \\ (0*7+0*6) \\ (0*7+2*6) \\ (1*7+1*6) \end{pmatrix} = \begin{pmatrix} 7 \\ 0 \\ 12 \\ 13 \end{pmatrix}$
211 | * `Scale it: 2 *` $\begin{pmatrix} 7 \\ 0 \\ 12 \\ 13 \end{pmatrix} = \begin{pmatrix} 14 \\ 0 \\ 24 \\ 26 \end{pmatrix}$
212 |
213 | 2. **Frozen Path (left side):**
214 | * `W_frozen * x =` $\begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} = \begin{pmatrix} (1+2+3) \\ (2+4+6) \\ (3+6+9) \\ (4+8+12) \end{pmatrix} = \begin{pmatrix} 6 \\ 12 \\ 18 \\ 24 \end{pmatrix}$
215 |
216 | 3. **Final Output:**
217 | * `y =` $\begin{pmatrix} 6 \\ 12 \\ 18 \\ 24 \end{pmatrix} + \begin{pmatrix} 14 \\ 0 \\ 24 \\ 26 \end{pmatrix} = \begin{pmatrix} 20 \\ 12 \\ 42 \\ 50 \end{pmatrix}$
218 |
219 | #### The Astonishing Savings
220 |
221 | This math is why LoRA works. Let's return to the realistic LLM layer (`4096x4096`) to see the impact.
222 |
223 | | Method | Trainable Parameters | Calculation | Parameter Reduction |
224 | | :--- | :--- | :--- | :--- |
225 | | **Full Fine-Tuning** | 16,777,216 | `4096 * 4096` | 0% |
226 | | **LoRA (r=8)** | **65,536** | `(8 * 4096) + (4096 * 8)` | **99.61%** |
227 |
228 | By performing the efficient forward pass during training, we only need to store and update the parameters for the tiny `A` and `B` matrices, achieving a >99% parameter reduction while still being able to modify the behavior of the massive base layer.
229 |
230 | ## **Chapter 5: The Main Event - Implementing LoRA in PyTorch**
231 |
232 | We will now translate the math from the previous chapter into a reusable PyTorch `nn.Module`. Our goal is to create a `LoRALinear` layer that wraps a standard `nn.Linear` layer, freezes it, and adds the trainable `A` and `B` matrices.
233 |
234 | #### The `LoRALinear` Module
235 |
236 | Here is the complete implementation, followed by a breakdown of each part.
237 |
238 | ```python
239 | import torch
240 | import torch.nn as nn
241 | import torch.nn.functional as F
242 | import math
243 |
244 | class LoRALinear(nn.Module):
245 | def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
246 | super().__init__()
247 | # --- Store hyperparameters ---
248 | self.r = r
249 | self.alpha = alpha
250 | self.scaling = self.alpha / self.r
251 |
252 | # --- Store and freeze the original linear layer ---
253 | self.base = base
254 | self.base.weight.requires_grad_(False)
255 | # Also freeze the bias if it exists
256 | if self.base.bias is not None:
257 | self.base.bias.requires_grad_(False)
258 |
259 | # --- Create the trainable LoRA matrices A and B ---
260 | # A has shape [r, in_features]
261 | # B has shape [out_features, r]
262 | self.lora_A = nn.Parameter(torch.empty(r, self.base.in_features))
263 | self.lora_B = nn.Parameter(torch.empty(self.base.out_features, r))
264 |
265 | # --- Initialize the weights ---
266 | # A is initialized with a standard method
267 | nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
268 | # B is initialized with zeros
269 | nn.init.zeros_(self.lora_B)
270 |
271 | def forward(self, x: torch.Tensor) -> torch.Tensor:
272 | # 1. The original, frozen path
273 | base_output = self.base(x)
274 |
275 | # 2. The efficient LoRA path: B(A(x))
276 | # F.linear(x, self.lora_A) computes x @ A.T
277 | # F.linear(..., self.lora_B) computes (x @ A.T) @ B.T
278 | lora_update = F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling
279 |
280 | # 3. Return the combined output
281 | return base_output + lora_update
282 | ```
283 |
284 | **Breakdown:**
285 |
286 | 1. **`__init__(self, base, r, alpha)`**:
287 | * It accepts the original `nn.Linear` layer (`base`) that we want to adapt.
288 | * `self.base.weight.requires_grad_(False)`: This is the critical **"freezing"** step. We tell PyTorch's autograd engine not to compute gradients for the original weights, so they will never be updated by the optimizer.
289 | * `nn.Parameter(...)`: We register `lora_A` and `lora_B` as official trainable parameters of the module. Their shapes are derived directly from the base layer and the rank `r`.
290 | * `nn.init.zeros_(self.lora_B)`: This is a crucial initialization detail. By starting `B` as a zero matrix, the entire LoRA update (`B @ A`) is zero at the beginning of training. This means our `LoRALinear` layer initially behaves exactly like the original frozen layer, and the model learns the "change" from a stable starting point.
291 |
292 | 2. **`forward(self, x)`**:
293 | * This is a direct translation of the formula: $y = W_{frozen}x + \frac{\alpha}{r} B(Ax)$
294 | * We compute the output of the frozen path and the LoRA path separately.
295 | * The nested `F.linear` calls are a highly efficient PyTorch way to compute `(x @ A.T) @ B.T` without ever forming the full $\Delta W$ matrix.
296 | * Finally, we add them together.
297 |
298 | #### Applying LoRA to a Model
299 |
300 | Now we need a helper function to swap out the `nn.Linear` layers in any given model with our new `LoRALinear` layer.
301 |
302 | ```python
303 | def apply_lora(model: nn.Module, r: int, alpha: float = 16.0):
304 | """
305 | Replaces all nn.Linear layers in a model with LoRALinear layers.
306 | """
307 | for name, module in list(model.named_modules()):
308 | if isinstance(module, nn.Linear):
309 | # Find the parent module to replace the child
310 | parent_name, child_name = name.rsplit('.', 1)
311 | parent_module = model.get_submodule(parent_name)
312 |
313 | # Replace the original linear layer
314 | setattr(parent_module, child_name, LoRALinear(module, r=r, alpha=alpha))
315 | ```
316 |
317 | #### Minimal End-to-End Demo
318 |
319 | Let's see it all work together.
320 |
321 | **1. Create a toy model:**
322 | ```python
323 | model = nn.Sequential(
324 | nn.Linear(128, 256),
325 | nn.ReLU(),
326 | nn.Linear(256, 10) # e.g., for classification
327 | )
328 | ```
329 |
330 | **2. Inject LoRA layers:**
331 | ```python
332 | apply_lora(model, r=8, alpha=16.0)
333 | print(model)
334 | ```
335 | The output will show that our `nn.Linear` layers have been replaced by `LoRALinear`.
336 |
337 | **3. Isolate the Trainable Parameters:**
338 | This is the most important step. We create an optimizer that *only* sees the LoRA weights.
339 |
340 | ```python
341 | # Filter for parameters that require gradients (only lora_A and lora_B)
342 | trainable_params = [p for p in model.parameters() if p.requires_grad]
343 | trainable_param_names = [name for name, p in model.named_parameters() if p.requires_grad]
344 |
345 | print("\nTrainable Parameters:")
346 | for name in trainable_param_names:
347 | print(name)
348 |
349 | # Create an optimizer that only updates the LoRA weights
350 | optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)
351 | ```
352 |
353 | **Output:**
354 | ```text
355 | Trainable Parameters:
356 | 0.lora_A
357 | 0.lora_B
358 | 2.lora_A
359 | 2.lora_B
360 | ```
361 | This proves our success. The optimizer is completely unaware of the massive, frozen weights (`0.base.weight`, `2.base.weight`, etc.) and will only update our tiny, efficient LoRA matrices.
362 |
363 | ## **Chapter 6: Conclusion - From a Toy Model to a Real Transformer**
364 |
365 | Let's recap the journey. We started with a simple `nn.Linear` layer and saw how its parameter count explodes at the scale of a real LLM. We then introduced the core mathematical trick of LoRA: approximating the massive update matrix `ΔW` with two small, low-rank matrices `A` and `B`. This simple idea led to a staggering >99% reduction in trainable parameters. Finally, we translated that math into a clean, reusable `LoRALinear` PyTorch module and proved that an optimizer could be set up to *only* train these new, tiny matrices.
366 |
367 | #### Where does LoRA actually go in an LLM?
368 |
369 | The `nn.Linear` layers we've been working with are not just abstract examples. They are the primary components of a Transformer, the architecture behind virtually all modern LLMs.
370 |
371 | When you apply LoRA to a model like Llama or Mistral, you are targeting these specific linear layers:
372 |
373 | * **Self-Attention Layers:** The most common targets are the projection matrices for the **query (`q_proj`)** and **value (`v_proj`)**. Adapting these allows the model to change *what it pays attention to* in the input text, which is incredibly powerful for task-specific fine-tuning.
374 | * **Feed-Forward Layers (MLP):** Transformers also have blocks of linear layers that process information after the attention step. Applying LoRA here helps modify the model's learned representations and knowledge.
375 |
376 | So, when you see a LoRA implementation for a real LLM, the `apply_lora` function is simply more selective, replacing only the linear layers named `q_proj`, `v_proj`, etc., with the `LoRALinear` module you just built.
377 |
378 | #### Why This Works So Well
379 |
380 | The stunning effectiveness of LoRA relies on a powerful hypothesis: the knowledge needed to adapt a pre-trained model to a new task is much simpler than the model's entire knowledge base. You don't need to re-learn the entire English language to make a model a better chatbot. You only need to steer its existing knowledge. This "steering" information lies in a low-dimensional space, which a low-rank update `ΔW = B @ A` can capture perfectly.
381 |
382 | You now have a deep, practical understanding of one of the most important techniques in modern AI. You know the "what," the "why," and the "how" behind LoRA, giving you the foundation to efficiently adapt massive language models.
--------------------------------------------------------------------------------
/docs/llm/adam.md:
--------------------------------------------------------------------------------
1 | # **Title: Give me 30 min, I will make the Adam Optimizer Click. Forever.**
2 |
3 | ## Intro
4 |
5 | You know Gradient Descent. You know the formula: `new_weight = old_weight - learning_rate * gradient`. It's simple. It works.
6 |
7 | But it's not what powers modern AI.
8 |
9 | In every state-of-the-art model, in every high-performance training script, you see the same name: **Adam**. You're told to just use it. It's the default. It's "better."
10 |
11 | But why?
12 |
13 | You look up the algorithm and are hit with a wall of math. A black box of Greek letters and strange terms that feel impossibly complex.
14 |
15 | * Exponentially Weighted Moving Average (EWMA)
16 | * First and Second Moments
17 | * Bias Correction
18 |
19 | It seems like a magic spell you're supposed to cast without understanding.
20 |
21 | Here's the secret: **Adam isn't one complex idea. It's three simple ideas, bolted together to solve three specific problems.**
22 |
23 | Give me 30 minutes. We will tear Adam down to its fundamental parts and rebuild it from the ground up. No skipped steps. No magic. By the end, you will have a deep, intuitive, and permanent understanding of every single component.
24 |
25 | This is the algorithm you are about to master:
26 |
27 | 1. **First Moment (Momentum):** $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
28 | 2. **Second Moment (Adaptive LR):** $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
29 | 3. **Bias Correction:** $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
30 | 4. **The Update:** $\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
31 |
32 | That wall of math will look simple. You will see *why* each piece exists and the exact problem it solves.
33 |
34 | Ready to make it click? Let's begin.
35 |
36 | ## **Part 1: The First Problem - Inefficient Progress**
37 |
38 | **The Core Problem:** Gradient Descent is memoryless. As it gets closer to the minimum and the gradient shrinks, its steps become smaller and smaller, causing it to slow down dramatically.
39 |
40 | Let's use the simple function `f(p) = p**2`. The minimum is at `p=0`, and the gradient is `f'(p) = 2p`. We start at `p=10` and use a learning rate of `η = 0.1`.
41 |
42 | #### **Algorithm 1: Standard Gradient Descent**
43 |
44 | The update rule is simple and direct.
45 | ```
46 | FOR each iteration:
47 | gradient = 2 * params
48 | params = params - 0.1 * gradient
49 | ```
50 | This is our baseline—the slow, steady crawl.
51 |
52 | | Iteration | Current `p` | Gradient `g = 2p` | Update `0.1 * g` | New `p` |
53 | | :-------- | :---------- | :---------------- | :--------------- | :------ |
54 | | 0 | 10.000 | 20.000 | 2.000 | 8.000 |
55 | | 1 | 8.000 | 16.000 | 1.600 | 6.400 |
56 | | 2 | 6.400 | 12.800 | 1.280 | 5.120 |
57 | | 3 | 5.120 | 10.240 | 1.024 | 4.096 |
58 | | 4 | 4.096 | 8.192 | 0.819 | 3.277 |
59 |
60 | **Analysis of the Slowness:** The "Update" size is constantly shrinking: `2.0` → `1.6` → `1.28`... This is **deceleration**. The algorithm becomes less effective with every step.
61 |
62 | #### **Algorithm 2: Gradient Descent with Momentum**
63 |
64 | Now, let's add a `velocity` term with a more moderate `beta` of `0.5`. This will allow inertia to build up without running out of control.
65 |
66 | ```
67 | velocity = 0
68 | FOR each iteration:
69 | gradient = 2 * params
70 | velocity = 0.5 * velocity + gradient
71 | params = params - 0.1 * velocity
72 | ```
73 | Watch the difference in convergence.
74 |
75 | | Iteration | Current `p` | Gradient `g` | Velocity `v = 0.5*v + g` | New `p` |
76 | | :-------- | :---------- | :----------- | :--------------------------- | :------ |
77 | | 0 | 10.000 | 20.000 | `0.5*0 + 20.0 = 20.000` | 8.000 |
78 | | 1 | 8.000 | 16.000 | `0.5*20.0 + 16.0 = 26.000` | 5.400 |
79 | | 2 | 5.400 | 10.800 | `0.5*26.0 + 10.8 = 23.800` | 3.020 |
80 | | 3 | 3.020 | 6.040 | `0.5*23.8 + 6.04 = 17.940` | 1.226 |
81 | | 4 | 1.226 | 2.452 | `0.5*17.94 + 2.45 = 11.422` | 0.084 |
82 |
83 | **Analysis of the Success:**
84 | * **Compare `p` at Iteration 4:** Standard Gradient Descent is still far away at `3.277`. Momentum is already at `0.084`, practically at the minimum. This is a clear, unambiguous win.
85 | * **Look at the `velocity`:** In step 1, the gradient was `16`, but the velocity was `26`. In step 2, the gradient was `10.8`, but the velocity was `23.8`. Because the gradients were all in the same direction, they accumulated, creating a much larger and more effective update step. This is **controlled acceleration**.
86 | * **No Instability:** Unlike the previous bad example, this version converges beautifully without any wild overshooting.
87 |
88 | ---
89 | #### **Revisiting the Ravine: The Two Jobs of Momentum**
90 |
91 | Now we can confidently state that Momentum is a superior algorithm. In a complex landscape like our 2D ravine (`f(p) = p[0]**2 + 50 * p[1]**2`), it performs two critical jobs simultaneously:
92 |
93 | 1. **Accelerates:** In the shallow `p[0]` direction, the gradients are small but consistent. Momentum builds up velocity here—just like in our successful 1D example—speeding up progress along the valley floor.
94 | 2. **Damps:** In the steep `p[1]` direction, the gradients are huge but constantly flip signs (`+150`, `-120`, etc.). When Momentum averages these opposing forces, they cancel each other out, which powerfully suppresses the wasteful zig-zagging.
95 |
96 | Momentum intelligently uses its memory of past gradients to navigate more efficiently.
97 |
98 | **Problem Solved:** We have a mechanism to fix Gradient Descent's inefficient, memoryless updates.
99 |
100 | **But a new problem emerges:** While smarter, this approach still applies the same learning rate to every parameter. Isn't there a way to give each parameter its *own* adaptive learning rate from the start?
101 |
102 | ## **Part 2: The Second Problem - Inflexible Learning Rates**
103 |
104 | **The Core Problem:** Momentum helps find a better direction, but it's still handicapped by a single, global learning rate. This fails when parameters have vastly different sensitivities.
105 |
106 | Let's design a function where this failure is guaranteed:
107 | `f(p) = 50 * p[0]**2 + p[1]**2`
108 |
109 | The minimum is at `(0, 0)`. The gradient vector is:
110 | * `∂f/∂p[0] = 100 * p[0]`
111 | * `∂f/∂p[1] = 2 * p[1]`
112 |
113 | The gradient for `p[0]` is **50 times stronger** than for `p[1]`. This means `p[0]` is an extremely "sensitive" parameter, while `p[1]` is "stubborn."
114 |
115 | #### **Algorithm 1: Naive Gradient Descent**
116 |
117 | To prevent the update for the sensitive `p[0]` from exploding, we are forced to choose a tiny learning rate. Let's use `η = 0.01`. We will start at `p = (1.5, 10.0)`.
118 |
119 | ```
120 | FOR each iteration:
121 | gradient = [100*p[0], 2*p[1]]
122 | params = params - 0.01 * gradient
123 | ```
124 | Watch how slowly `p[1]` converges.
125 |
126 | | Iteration | Current `p` | Gradient `g` | Update `0.01 * g` | New `p` |
127 | | :-------- | :---------- | :------------- | :------------------ | :---------- |
128 | | 0 | `(1.500, 10.000)` | `[150.0, 20.0]`| `[1.500, 0.200]` | `(0.000, 9.800)` |
129 | | 1 | `(0.000, 9.800)` | `[0.0, 19.6]` | `[0.000, 0.196]` | `(0.000, 9.604)` |
130 | | 2 | `(0.000, 9.604)` | `[0.0, 19.208]`| `[0.000, 0.192]` | `(0.000, 9.412)` |
131 | | 3 | `(0.000, 9.412)` | `[0.0, 18.824]`| `[0.000, 0.188]` | `(0.000, 9.224)` |
132 |
133 | **Analysis of the Failure:** The learning rate `η=0.01` was just right for `p[0]`, which converged in one step. But this same learning rate is cripplingly small for `p[1]`. Its progress is a slow crawl: `10.0` → `9.8` → `9.6` → `9.4`. It will take hundreds of steps to reach zero. This is the definition of inefficiency.
134 |
135 | #### **Algorithm 2: AdaGrad (Adaptive Gradient)**
136 |
137 | AdaGrad gives each parameter its own learning rate that adapts over time. It does this by dividing the base learning rate by the square root of the sum of all past squared gradients for that parameter.
138 |
139 | **THE ALGORITHM: AdaGrad**
140 | ```
141 | g_squared = [0, 0]
142 | FOR each iteration:
143 | gradient = calculate_gradient(params)
144 | g_squared += gradient**2
145 | adapted_lr = learning_rate / (sqrt(g_squared) + epsilon)
146 | params = params - adapted_lr * gradient
147 | ```
148 | Because it's adaptive, we can use a much more aggressive base learning rate. Let's use `η = 1.5`.
149 |
150 | | Iteration | Current `p` | Gradient `g` | `g_squared` (Accumulator) | Effective LR `η/sqrt(g_sq)` | Update | New `p` |
151 | | :-------- | :---------- | :------------- | :------------------------ | :------------------------- | :------- | :---------- |
152 | | 0 | `(1.5, 10.0)` | `[150, 20]` | `[22500, 400]` | `[0.01, 0.075]` | `[1.5, 1.5]` | `(0.0, 8.5)` |
153 | | 1 | `(0.0, 8.5)` | `[0, 17]` | `[22500, 689]` | `[0.01, 0.057]` | `[0, 0.97]`| `(0.0, 7.53)` |
154 | | 2 | `(0.0, 7.53)` | `[0, 15.06]` | `[22500, 916]` | `[0.01, 0.050]` | `[0, 0.75]`| `(0.0, 6.78)` |
155 | | 3 | `(0.0, 6.78)` | `[0, 13.56]` | `[22500, 1100]` | `[0.01, 0.045]` | `[0, 0.61]`| `(0.0, 6.17)` |
156 |
157 | **Analysis of the Success:**
158 | * **Look at `p[0]`:** In step 0, the accumulated `g_squared[0]` was `22500`. Its square root is `150`. The effective learning rate for `p[0]` became `1.5 / 150 = 0.01`. AdaGrad *automatically* discovered the perfect small learning rate for the sensitive parameter.
159 | * **Look at `p[1]`:** In step 0, `g_squared[1]` was only `400`. Its square root is `20`. The effective learning rate for `p[1]` was `1.5 / 20 = 0.075`. This is much larger than the `0.01` used by naive GD.
160 | * **The Final Comparison:** After 4 steps, naive Gradient Descent got `p[1]` to `9.224`. AdaGrad got it to `6.17`. AdaGrad is converging dramatically faster because it assigned a more appropriate learning rate to the stubborn parameter.
161 |
162 | **Problem Solved:** We have introduced adaptive, per-parameter learning rates, making optimization robust to wildly different gradient scales.
163 |
164 | **But a new problem emerges:** Look at AdaGrad's `g_squared` accumulator. The values `[22500, 400]` grew to `[22500, 1100]`. This sum *only ever increases*. Over a long training run, this denominator will grow so large that the effective learning rate for all parameters will shrink to effectively zero, stopping learning prematurely. This is known as a "decaying learning rate" problem, and it's what Adam must fix next.
165 |
166 | ## **Part 3: Deconstructing Adam - The Theory**
167 |
168 | Our goal is to create an optimizer that combines the directional intelligence of Momentum with the adaptive learning rates of AdaGrad, while fixing AdaGrad's "dying learning rate" problem. Adam achieves this by using a more flexible memory system for both direction and magnitude.
169 |
170 | #### **The Complete Adam Algorithm**
171 |
172 | Here is the full blueprint of the algorithm we are about to deconstruct.
173 |
174 | 1. **Initialize:**
175 | * `m = 0` (First moment vector)
176 | * `v = 0` (Second moment vector)
177 | * `t = 0` (Timestep)
178 |
179 | 2. **Loop for each training iteration:**
180 | * `t = t + 1`
181 | * `g_t =` Calculate gradient at current step
182 | * **Update Biased Moment Estimates:**
183 | * $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
184 | * $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
185 | * **Compute Bias-Corrected Estimates:**
186 | * $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
187 | * $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
188 | * **Update Parameters:**
189 | * $\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
190 |
191 | Now, let's break down what each part of this machine does.
192 |
193 | #### **The Core Component: Exponentially Weighted Moving Average (EWMA)**
194 |
195 | Adam is built entirely on the concept of the EWMA, which is a "forgetful" average. Its formula is:
196 |
197 | `average_t = β * average_{t-1} + (1 - β) * new_value_t`
198 |
199 | * `β` (beta) is the "decay rate" or memory factor, a number between 0 and 1. It controls how much of the old average to keep.
200 | * A high `β` (like 0.99) means the average has a long memory and changes slowly.
201 | * A low `β` (like 0.1) means the average has a short memory and reacts quickly to new values.
202 |
203 | This "forgetful" property is what fixes AdaGrad's problem of its learning rate only ever shrinking.
204 |
205 | #### **Line-by-Line Breakdown of the Algorithm**
206 |
207 | **Line 1: `m_t = β₁ * m_{t-1} + (1 - β₁) * g_t`**
208 | * **What it is:** The **First Moment Estimate**.
209 | * **Purpose:** This is the **Direction Engine**. It calculates the EWMA of the gradients (`g_t`). It acts like a more robust version of Momentum's velocity, tracking the average direction of descent.
210 | * **`β₁` (beta1):** The memory factor for the direction. It is typically set to `0.9`.
211 |
212 | **Line 2: `v_t = β₂ * v_{t-1} + (1 - β₂) * g_t²`**
213 | * **What it is:** The **Second Moment Estimate**.
214 | * **Purpose:** This is the **Adaptive Learning Rate Engine**. It calculates the EWMA of the *squared* gradients (`g_t²`). This tracks the average magnitude of the gradients, replacing AdaGrad's ever-growing sum with a "forgetful" average.
215 | * **`β₂` (beta2):** The memory factor for the magnitude. It is typically set to `0.999`, giving it a much longer memory than `m` to ensure the learning rate stays stable.
216 |
217 | **Line 3 & 4: The Bias Correction (`m_hat`, `v_hat`)**
218 | * **The Problem:** `m` and `v` are initialized to zero. At the beginning of training, their values are artificially small because they are biased toward this zero starting point. This would cause the optimizer to take tiny, inefficient steps initially.
219 | * **The Solution:** These lines correct for that initial bias.
220 | * `β₁^t` means the constant `β₁` raised to the power of the current timestep `t`.
221 | * At `t=1`, this correction is large, boosting the estimates to be more accurate.
222 | * As `t` increases, the correction term `(1 - β^t)` approaches 1, and the correction fades away, which is exactly what we need.
223 |
224 | **Line 5: The Final Update (`θ_t = ...`)**
225 | * **What it is:** The actual parameter update step.
226 | * **Purpose:** This line combines all the components.
227 | * It determines the step **direction** using `m_hat`.
228 | * It scales the step size for each parameter using `sqrt(v_hat)`. This division is what gives Adam its adaptive, per-parameter learning rate.
229 | * `η` is the master learning rate you provide, and `ε` is a tiny value to prevent division by zero.
230 |
231 | In essence, Adam runs two intelligent, "forgetful" averages—one for direction and one for magnitude—corrects their initial bias, and then uses them to perform a robust and adaptive update step.
232 |
233 | ## **Part 4: Adam in Action - A Definitive Example**
234 |
235 | **The Scenario**
236 | We will create a situation designed to make naive Gradient Descent fail catastrophically, so we can see how Adam's internal mechanisms save it.
237 |
238 | * **Function:** `f(p) = p**2` (Minimum at `p=0`, Gradient `g = 2p`)
239 | * **Starting Point:** `p = 10`
240 | * **Learning Rate (`η`):** We will use an **explosive** learning rate of `η = 1.05`. For this function, any learning rate greater than `1.0` will cause naive Gradient Descent to diverge.
241 |
242 | #### **Baseline: Naive Gradient Descent (Complete Failure)**
243 |
244 | The update rule is simple: `p_new = p_old - η * g`.
245 |
246 | | Iteration | Current `p` | Gradient `g = 2p` | Update `1.05 * g` | New `p` |
247 | | :-------- | :---------- | :---------------- | :---------------- | :------- |
248 | | 0 | 10.00 | 20.00 | 21.00 | -11.00 |
249 | | 1 | -11.00 | -22.00 | -23.10 | 12.10 |
250 | | 2 | 12.10 | 24.20 | 25.41 | -13.31 |
251 | | 3 | -13.31 | -26.62 | -27.95 | 14.64 |
252 | | 4 | 14.64 | 29.28 | 30.74 | -16.10 |
253 |
254 | **Analysis of the Failure:** Look at the absolute value of `p`. It is growing at every step: `10` → `11` → `12.1` → `13.31`. This is **divergence**. The algorithm is not just inefficient; it is fundamentally broken and exploding towards infinity. It is unusable with this learning rate.
255 |
256 | ---
257 |
258 | #### **Adam: Taming the Explosive Learning Rate**
259 |
260 | Now, we give Adam the **exact same unusable learning rate** (`η = 1.05`) and watch how its machinery handles the situation. We will use standard `β₁=0.9` and `β₂=0.999`. Let's trace the first 10 iterations to see the full story.
261 |
262 | | t | `p` | `g` | `m` | `v` | `m_hat` | `v_hat` | Update | New `p` |
263 | |:-:|:----|:----|:----|:----|:----|:----|:--- |:--- |
264 | | 1 | 10.00 | 20.00 | 2.00 | 0.40 | 20.00 | 400.0 | 1.05 | 8.95 |
265 | | 2 | 8.95 | 17.90 | 3.59 | 0.72 | 18.89 | 360.4 | 1.04 | 7.91 |
266 | | 3 | 7.91 | 15.82 | 4.81 | 0.97 | 17.58 | 354.7 | 0.98 | 6.93 |
267 | | 4 | 6.93 | 13.86 | 5.72 | 1.16 | 16.58 | 341.7 | 0.94 | 5.99 |
268 | | 5 | 5.99 | 11.98 | 6.35 | 1.31 | 15.51 | 319.8 | 0.91 | 5.08 |
269 | | 6 | 5.08 | 10.16 | 6.73 | 1.41 | 14.41 | 294.1 | 0.88 | 4.20 |
270 | | 7 | 4.20 | 8.40 | 6.90 | 1.48 | 13.35 | 266.3 | 0.85 | 3.35 |
271 | | 8 | 3.35 | 6.70 | 6.88 | 1.51 | 12.35 | 237.9 | 0.82 | 2.53 |
272 | | 9 | 2.53 | 5.06 | 6.70 | 1.51 | 11.41 | 209.9 | 0.80 | 1.73 |
273 | | 10| 1.73 | 3.46 | 6.38 | 1.48 | 10.53 | 182.9 | 0.78 | 0.95 |
274 |
275 | **Analysis of the Definitive Success:**
276 |
277 | 1. **Adam is Stable and Converging:** The primary result is undeniable. Where naive GD exploded, Adam's `p` value steadily and rapidly decreases: `10 → 8.95 → ... → 0.95`. It successfully tamed an otherwise unusable learning rate and is converging beautifully.
278 |
279 | 2. **The Adaptive Rate is the Hero:** How did it survive? Look at the `v_hat` column. It starts high (`400`) because the initial gradients are large, and then it slowly decays as `p` gets smaller. The crucial term is the denominator of the final update, `sqrt(v_hat)`. At step 1, this was `sqrt(400) = 20`. This means Adam calculated an **"effective learning rate"** of `η / 20 = 1.05 / 20 ≈ 0.053`. It automatically throttled the explosive `1.05` down to a safe and effective `0.053`. This dynamic self-correction is Adam's superpower.
280 |
281 | 3. **Momentum Provides the Smoothness:** Look at the `m_hat` column. It provides a smooth, consistent estimate of the direction, preventing the wild oscillations we saw in the failed GD example.
282 |
283 | This walkthrough proves Adam's value. It is not just another optimizer; it is a robust, self-correcting system. It takes a potentially dangerous hyperparameter (the learning rate) and adapts it on the fly, protecting the training process from instability and divergence. This robustness is precisely why Adam is the default, go-to optimizer for nearly all modern deep learning applications.
284 |
285 | ## **Conclusion: From Simple Steps to Intelligent Adaptation**
286 |
287 | You have just mastered the core logic behind modern optimization. We began with the simple idea of Gradient Descent and systematically solved its flaws, piece by piece, culminating in Adam.
288 |
289 | The journey was a logical progression:
290 |
291 | | Problem | Solution | Key Idea |
292 | | :-------------------------------------------- | :---------------------- | :--------------------------------------------------------------------- |
293 | | 1. **Inefficient Direction & Oscillation** | **Momentum** | Average past gradients to find a better, smoother direction. |
294 | | 2. **Inflexible, One-Size-Fits-All LR** | **AdaGrad** | Give each parameter its own learning rate based on its gradient history. |
295 | | 3. **AdaGrad's LR Dies Prematurely** | **"Forgetful" Averages (EWMA)** | Replace the permanent sum with a moving average that forgets the past. |
296 | | 4. **Initial Steps are Too Small** | **Bias Correction** | Correct the initial bias of the moving averages for a faster start. |
297 |
298 | **Adam is not a single complex idea; it is the synthesis of these four solutions.** It uses an EWMA of gradients for **direction** (Momentum) and an EWMA of squared gradients for a per-parameter **adaptive rate** (AdaGrad), fixing both with **bias correction**.
299 |
300 | The result is a robust, high-performance algorithm that automatically adapts to the unique challenges of a complex loss landscape. You now understand not just *what* Adam does, but *why* every single component of its machinery exists. It's not magic—it's brilliant engineering.
--------------------------------------------------------------------------------
/docs/ml/linear.md:
--------------------------------------------------------------------------------
1 | # **Title: Give me 20 min, I will make Linear Regression Click Forever**
2 |
3 | > **Thumbnail: Linear Regression is Actually EASY**
4 |
5 | ## **Section 1: The Promise: From Data to Predictions in 20 Minutes**
6 |
7 | Give me 20 minutes, I will make Linear Regression click for you.
8 |
9 | Linear regression seems simple—it's just fitting a line to data. But then you encounter a wall of jargon: *loss functions*, *gradient descent*, *cost surface*, *learning rates*, and the *normal equation*. The goal is to get past the terminology and see the simple, powerful machine at work.
10 |
11 | We will start with a concrete problem: predicting a student's final exam score based on the hours they studied.
12 |
13 | Here is our data:
14 | | Hours Studied (x) | Exam Score (y) |
15 | | :--------------- | :------------- |
16 | | 1 | 2 |
17 | | 2 | 4 |
18 | | 3 | 5 |
19 | | 4 | 4 |
20 | | 5 | 5 |
21 |
22 | Our goal is to find a function that takes *hours studied* and outputs a predicted *exam score*.
23 |
24 | By the end of this tutorial, you will understand the fundamental components of not just linear regression, but many machine learning models. You will be able to explain, use, and even code the following from scratch.
25 |
26 | **Your Learning Promise:**
27 |
28 | You will master the three pillars of this model and a one-shot analytical solution.
29 |
30 | 1. **The Model (Hypothesis Function):** The formula that makes predictions. For a single input feature like 'hours studied', it's a simple line. For multiple features, it's a plane or hyperplane.
31 | * **Formula:** $\hat{y} = w \cdot x + b$
32 | * **Vector Form:** $\hat{y} = \mathbf{w}^T \mathbf{x} + b$
33 |
34 | 2. **The Loss Function (Cost Function):** A function that measures how bad our model's predictions are. Our goal is to make this number as small as possible. We will use the Mean Squared Error (MSE).
35 | * **Formula:** $J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$
36 |
37 | 3. **The Optimizer (Gradient Descent):** The algorithm that systematically finds the best values for `w` and `b` by minimizing the loss function. It's like walking down a hill to find the lowest point.
38 | * **Update Rule:**
39 | $w \leftarrow w - \alpha \frac{\partial J}{\partial w}$
40 | $b \leftarrow b - \alpha \frac{\partial J}{\partial b}$
41 |
42 | 4. **The Analytical Solution (Normal Equation):** A direct, one-shot formula to calculate the best `w` and `b` without any iteration. It is a powerful shortcut.
43 | * **Formula:** $\theta = (X^T X)^{-1} X^T y$
44 |
45 | These components are the engine of linear regression. Let's build it, piece by piece.
46 |
47 | ## **Section 2: The Model: What is a "Best Fit" Line?**
48 |
49 | **The Big Picture:** At its core, linear regression is about finding a line that best summarizes the relationship between our input (`x`) and output (`y`). Imagine our data as points on a graph. Our goal is to draw one straight line through those points that is as close to all of them as possible.
50 |
51 | **A Concrete Example:** Let's use our student data.
52 |
53 | | Hours Studied (x) | Exam Score (y) |
54 | | :--------------- | :------------- |
55 | | 1 | 2 |
56 | | 2 | 4 |
57 | | 3 | 5 |
58 | | 4 | 4 |
59 | | 5 | 5 |
60 |
61 | If we plot this, we get a scatter of points.
62 |
63 | ```
64 | A 2D scatter plot.
65 | X-axis is "Hours Studied", from 0 to 6.
66 | Y-axis is "Exam Score", from 0 to 6.
67 | Points are plotted at: (1,2), (2,4), (3,5), (4,4), (5,5).
68 | The points generally trend upwards and to the right.
69 | ```
70 |
71 | Now, let's try to draw two different lines through this data to see what "fitting" means.
72 |
73 | ```
74 | Same scatter plot as above.
75 | Add Line 1 (a bad fit): A red dashed line starting at (0,0) and going through (5,5). It passes far below some points and far above others.
76 | Add Line 2 (a good fit): A green solid line that doesn't necessarily hit any single point, but passes through the "middle" of the cloud of points, minimizing the average distance to all of them.
77 | ```
78 |
79 | Visually, the green line is a better fit. Our model is the equation for that line.
80 |
81 | **The Line Equation**
82 |
83 | The formula for any straight line is:
84 | $\hat{y} = w \cdot x + b$
85 |
86 | * $\hat{y}$ (y-hat): This is our **predicted** output value (e.g., predicted exam score).
87 | * $x$: This is our input value (e.g., hours studied).
88 | * $w$: The **weight** (or slope). It controls the steepness of the line. A bigger `w` means that for every hour studied, the predicted score increases more.
89 | * $b$: The **bias** (or y-intercept). It's the value of $\hat{y}$ when $x=0$. You can think of it as a baseline prediction.
90 |
91 | Finding the "best fit line" is just a search for the optimal values of `w` and `b`.
92 |
93 | **Expanding to More Inputs**
94 |
95 | What if we want to predict a house price based on its size (`x1`) and the number of bedrooms (`x2`)? The model scales easily. Instead of a line, we are now fitting a plane.
96 |
97 | $\hat{y} = w_1 \cdot x_1 + w_2 \cdot x_2 + b$
98 |
99 | The principle is identical: find the weights (`w1`, `w2`) and bias (`b`) that make the predictions closest to the actual house prices.
100 |
101 |
102 | ## **Section 3: The Loss Function: Quantifying the "Error"**
103 |
104 | **The Big Picture:** Our eyes can tell us the green line is better than the red one. But to find the *best* line, a computer needs a precise, mathematical way to measure how "bad" any given line is. This measurement is called the **loss function**. A high loss means a bad fit. A low loss means a good fit.
105 |
106 | Our goal is to find the `w` and `b` that give the lowest possible loss.
107 |
108 | **Mean Squared Error (MSE)**
109 |
110 | The most common loss function for regression is the Mean Squared Error. The formula looks intimidating, but the idea is simple. We calculate it in three steps for every point in our data:
111 |
112 | 1. **Calculate the error:** For a single point, find the difference between the predicted value and the actual value. This vertical distance is called the *residual*.
113 | `error = predicted_y - actual_y` or $\hat{y}_i - y_i$
114 | 2. **Square the error:** We square the error to get rid of negative signs (so errors don't cancel each other out) and to penalize large errors much more than small ones. An error of 3 becomes 9, while an error of 10 becomes 100.
115 | `squared_error = (error)^2`
116 | 3. **Take the mean:** We calculate the squared error for all our data points and then take the average. This gives us a single number that represents the overall quality of our line.
117 |
118 | **Let's Calculate the Loss for Two Lines**
119 |
120 | Let's prove with math that our visual intuition was right.
121 |
122 | * **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]`
123 | * **Line 1 (Bad Guess):** $\hat{y} = 1 \cdot x + 1$ (Here, `w=1`, `b=1`)
124 | * **Line 2 (Better Guess):** $\hat{y} = 0.6 \cdot x + 2.5$ (Here, `w=0.6`, `b=2.5`)
125 |
126 | #### Calculation for Line 1 (w=1, b=1)
127 |
128 | | x | y (actual) | $\hat{y} = 1x+1$ (predicted) | Error ($\hat{y}-y$) | Squared Error |
129 | |:-:|:----------:|:----------------------------:|:-------------------:|:---------------:|
130 | | 1 | 2 | 2 | 0 | 0 |
131 | | 2 | 4 | 3 | -1 | 1 |
132 | | 3 | 5 | 4 | -1 | 1 |
133 | | 4 | 4 | 5 | 1 | 1 |
134 | | 5 | 5 | 6 | 1 | 1 |
135 | | | | | **Sum:** | **4** |
136 |
137 | **MSE for Line 1 = (Sum of Squared Errors) / n = 4 / 5 = 0.8**
138 |
139 | #### Calculation for Line 2 (w=0.6, b=2.5)
140 |
141 | | x | y (actual) | $\hat{y} = 0.6x+2.5$ (predicted) | Error ($\hat{y}-y$) | Squared Error |
142 | |:-:|:----------:|:--------------------------------:|:-------------------:|:---------------:|
143 | | 1 | 2 | 3.1 | 1.1 | 1.21 |
144 | | 2 | 4 | 3.7 | -0.3 | 0.09 |
145 | | 3 | 5 | 4.3 | -0.7 | 0.49 |
146 | | 4 | 4 | 4.9 | 0.9 | 0.81 |
147 | | 5 | 5 | 5.5 | 0.5 | 0.25 |
148 | | | | | **Sum:** | **2.85** |
149 |
150 | **MSE for Line 2 = (Sum of Squared Errors) / n = 2.85 / 5 = 0.57**
151 |
152 | **The Result:** Line 2 has a lower MSE (0.57) than Line 1 (0.8). The math confirms it is a better fit. The goal of training, which we will cover next, is to find the values for `w` and `b` that produce the minimum possible MSE.
153 |
154 |
155 |
156 | ## **Section 4: The Training: Finding the Best Line with Gradient Descent**
157 |
158 | **The Big Picture:** We now have a model (`y = w*x + b`) and a way to score it (MSE). The final piece is the process for finding the specific `w` and `b` that result in the lowest possible MSE score. This process is called **training**, and the most common algorithm for it is **Gradient Descent**.
159 |
160 | **The Intuition: Walking Down a Mountain in the Fog**
161 |
162 | Imagine the loss function as a giant, hilly landscape. Every possible combination of `w` and `b` is a location on this landscape, and the altitude at that location is the MSE score. Our goal is to find the bottom of the lowest valley (the minimum MSE).
163 |
164 | The problem is, we're in a thick fog. We can't see the whole landscape. All we can do is feel the slope of the ground right where we're standing.
165 |
166 | Gradient Descent is a simple strategy:
167 | 1. **Check the slope:** Feel which direction is steepest downhill. In math, this slope is called the **gradient**.
168 | 2. **Take a small step:** Take one step in that downhill direction.
169 | 3. **Repeat:** From your new position, repeat the process.
170 |
171 | By taking many small steps, you will eventually walk down the hill and settle at the bottom of the valley.
172 |
173 | *(For a deep dive on the calculus behind calculating the gradient, check out our video: [link to hypothetical video on derivatives and partial derivatives])*
174 |
175 | **Formalizing the Algorithm**
176 |
177 | This "walking" process translates into a simple update rule for our parameters, `w` and `b`:
178 |
179 | * **Update Rule for `w`:** `w_new = w_old - learning_rate * gradient_w`
180 | * **Update Rule for `b`:** `b_new = b_old - learning_rate * gradient_b`
181 |
182 | Two new terms here:
183 | * **`learning_rate` ($\alpha$):** This controls the size of our downhill step. If it's too big, we might overshoot the valley. If it's too small, it could take forever to get to the bottom. It's a hyperparameter you choose.
184 | * **`gradient_w` and `gradient_b`:** These are the calculated slopes for `w` and `b`. They tell us how a small change in `w` or `b` will affect the MSE. The formulas for these gradients, derived from the MSE function, are:
185 | * $gradient_w = \frac{\partial J}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \cdot x_i$
186 | * $gradient_b = \frac{\partial J}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$
187 |
188 | **A Concrete Walkthrough: One Step of Gradient Descent**
189 |
190 | Let's perform a single training step.
191 |
192 | * **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]`
193 | * **Hyperparameter:** Let's choose a `learning_rate` of `0.01`.
194 | * **Step 0: Initialize.** We start with a random guess. Let's begin at `w = 0.0` and `b = 0.0`. This is our "before" state.
195 |
196 | The MSE for this initial line ($\hat{y} = 0$) is high:
197 | $MSE_{before} = \frac{(0-2)^2 + (0-4)^2 + (0-5)^2 + (0-4)^2 + (0-5)^2}{5} = \frac{4+16+25+16+25}{5} = 17.2$
198 |
199 | * **Step 1: Calculate the Gradients.** We use our formulas and data to find the slope at our current position (`w=0, b=0`).
200 |
201 | | x | y | $\hat{y} = 0x+0$ | Error ($\hat{y}-y$) | Error * x |
202 | |:-:|:-:|:----------------:|:-------------------:|:---------:|
203 | | 1 | 2 | 0 | -2 | -2 |
204 | | 2 | 4 | 0 | -4 | -8 |
205 | | 3 | 5 | 0 | -5 | -15 |
206 | | 4 | 4 | 0 | -4 | -16 |
207 | | 5 | 5 | 0 | -5 | -25 |
208 | | | | | **Sum = -20** | **Sum = -66** |
209 |
210 | Now, plug the sums into the gradient formulas (`n=5`):
211 | * `gradient_w` = (2 / 5) * (-66) = **-26.4**
212 | * `gradient_b` = (2 / 5) * (-20) = **-8.0**
213 |
214 | These gradients tell us the direction of steepest *ascent*. To go downhill, we move in the opposite direction.
215 |
216 | * **Step 2: Update the Parameters.** We use our update rule to take one small step.
217 |
218 | * `w_new = w_old - learning_rate * gradient_w`
219 | `w_new = 0.0 - 0.01 * (-26.4) = 0.264`
220 |
221 | * `b_new = b_old - learning_rate * gradient_b`
222 | `b_new = 0.0 - 0.01 * (-8.0) = 0.08`
223 |
224 | **The Result: Before and After**
225 |
226 | * **Before (Step 0):** `w = 0.0`, `b = 0.0`, `MSE = 17.2`
227 | * **After (Step 1):** `w = 0.264`, `b = 0.08`, `MSE = 11.45` (calculated by plugging the new `w` and `b` into the MSE formula)
228 |
229 | As you can see, after just **one** step, our line is already significantly better—the MSE has dropped from 17.2 to 11.45. The process simply repeats this exact calculation many times (`epochs`), with `w` and `b` getting closer to the optimal values with every step.
230 |
231 | **Convergence Over Multiple Steps**
232 |
233 | If we continue the process with a learning rate of `0.05`, the table below shows how `w` and `b` gradually converge to the optimal values (w = 0.6, b = 2.2):
234 |
235 | | Step | w | b | MSE |
236 | |-----:|------:|------:|------:|
237 | | 0 | 0.0000 | 0.0000 | 17.2000 |
238 | | 1 | 1.3200 | 0.4000 | 1.6464 |
239 | | 2 | 1.0680 | 0.3640 | 1.1047 |
240 | | 5 | 1.0807 | 0.4655 | 1.0276 |
241 | | 10 | 1.0412 | 0.6072 | 0.9418 |
242 | | 20 | 0.9720 | 0.8569 | 0.8084 |
243 | | 50 | 0.8231 | 1.3946 | 0.5981 |
244 | | 100 | 0.6951 | 1.8566 | 0.5015 |
245 | | 200 | 0.6173 | 2.1376 | 0.4807 |
246 | | 500 | 0.6001 | 2.1996 | 0.4800 |
247 | | 1000 | 0.6000 | 2.2000 | 0.4800 |
248 | | **Optimal** | **0.6000** | **2.2000** | **0.4800** |
249 |
250 | Notice how the MSE drops dramatically in the first step, then gradually refines. By step 1000, we've essentially converged to the optimal solution that the Normal Equation gives us instantly.
251 |
252 | ## **Section 5: An Alternative: The Normal Equation**
253 |
254 | **The Big Picture:** For the specific problem of linear regression, there's a powerful shortcut. Instead of taking thousands of small steps with Gradient Descent, we can use a direct formula to solve for the optimal `w` and `b` in one single calculation. This is called the **Normal Equation**.
255 |
256 | It's the mathematical equivalent of seeing the entire loss landscape from above and simply pointing to the lowest point, rather than feeling your way down in the fog.
257 |
258 | **The Derivation (Briefly)**
259 |
260 | The intuition comes from basic calculus: the minimum of a function is where its slope (derivative) is zero. The Normal Equation is what you get when you:
261 | 1. Write the MSE loss function using matrix notation.
262 | 2. Take the derivative with respect to your parameters (`w` and `b`).
263 | 3. Set that derivative to zero.
264 | 4. Solve for the parameters.
265 |
266 | The resulting formula is:
267 | $\theta = (X^T X)^{-1} X^T y$
268 |
269 | * $\theta$ (theta): A vector containing all our model parameters. In our case, $\theta = \begin{bmatrix} b \\ w \end{bmatrix}$.
270 | * $X$: The **design matrix**, which is our input data `xs` with an extra column of ones added for the bias term.
271 | * $y$: The vector of our actual output values.
272 |
273 | **Applying the Normal Equation to Our Example**
274 |
275 | Let's solve for the optimal `w` and `b` for our student data in one go.
276 |
277 | * **Data:** `xs = [1, 2, 3, 4, 5]`, `ys = [2, 4, 5, 4, 5]`
278 |
279 | **Step 1: Construct the matrix `X` and vector `y`**
280 |
281 | We need to add a column of ones to our `xs` to account for the bias term `b`. This is a crucial step.
282 |
283 | $$
284 | X =
285 | \begin{bmatrix}
286 | 1 & 1 \\
287 | 1 & 2 \\
288 | 1 & 3 \\
289 | 1 & 4 \\
290 | 1 & 5
291 | \end{bmatrix}
292 | ,\quad
293 | y =
294 | \begin{bmatrix}
295 | 2 \\
296 | 4 \\
297 | 5 \\
298 | 4 \\
299 | 5
300 | \end{bmatrix}
301 | $$
302 |
303 | **Step 2: Calculate $X^T X$**
304 |
305 | $$
306 | X^T =
307 | \begin{bmatrix}
308 | 1 & 1 & 1 & 1 & 1 \\
309 | 1 & 2 & 3 & 4 & 5
310 | \end{bmatrix}
311 | $$
312 |
313 | $$
314 | X^T X =
315 | \begin{bmatrix}
316 | 1 & 1 & 1 & 1 & 1 \\
317 | 1 & 2 & 3 & 4 & 5
318 | \end{bmatrix}
319 | \begin{bmatrix}
320 | 1 & 1 \\
321 | 1 & 2 \\
322 | 1 & 3 \\
323 | 1 & 4 \\
324 | 1 & 5
325 | \end{bmatrix}
326 | =
327 | \begin{bmatrix}
328 | 5 & 15 \\
329 | 15 & 55
330 | \end{bmatrix}
331 | $$
332 |
333 | **Step 3: Calculate the inverse, $(X^T X)^{-1}$**
334 |
335 | For a 2x2 matrix $\begin{bmatrix} a & b \\ c & d \end{bmatrix}$, the inverse is $\frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix}$.
336 |
337 | Determinant ($ad-bc$) = (5 * 55) - (15 * 15) = 275 - 225 = 50.
338 |
339 | $$(X^T X)^{-1} = \frac{1}{50}
340 | \begin{bmatrix}
341 | 55 & -15 \\
342 | -15 & 5
343 | \end{bmatrix}
344 | =
345 | \begin{bmatrix}
346 | 1.1 & -0.3 \\
347 | -0.3 & 0.1
348 | \end{bmatrix}
349 | $$
350 |
351 | **Step 4: Calculate $X^T y$**
352 |
353 | $$
354 | X^T y =
355 | \begin{bmatrix}
356 | 1 & 1 & 1 & 1 & 1 \\
357 | 1 & 2 & 3 & 4 & 5
358 | \end{bmatrix}
359 | \begin{bmatrix}
360 | 2 \\
361 | 4 \\
362 | 5 \\
363 | 4 \\
364 | 5
365 | \end{bmatrix}
366 | =
367 | \begin{bmatrix}
368 | 2+4+5+4+5 \\
369 | 2+8+15+16+25
370 | \end{bmatrix}
371 | =
372 | \begin{bmatrix}
373 | 20 \\
374 | 66
375 | \end{bmatrix}
376 | $$
377 |
378 | **Step 5: Calculate the final result, $\theta = (X^T X)^{-1} X^T y$**
379 |
380 | $$
381 | \theta =
382 | \begin{bmatrix}
383 | 1.1 & -0.3 \\
384 | -0.3 & 0.1
385 | \end{bmatrix}
386 | \begin{bmatrix}
387 | 20 \\
388 | 66
389 | \end{bmatrix}
390 | =
391 | \begin{bmatrix}
392 | (1.1 * 20) + (-0.3 * 66) \\
393 | (-0.3 * 20) + (0.1 * 66)
394 | \end{bmatrix}
395 | =
396 | \begin{bmatrix}
397 | 22 - 19.8 \\
398 | -6 + 6.6
399 | \end{bmatrix}
400 | =
401 | \begin{bmatrix}
402 | 2.2 \\
403 | 0.6
404 | \end{bmatrix}
405 | $$
406 |
407 | **The Result:**
408 |
409 | The Normal Equation gives us the exact optimal parameters in one calculation:
410 | * $b = 2.2$
411 | * $w = 0.6$
412 |
413 | The best fit line for our data is $\hat{y} = 0.6 \cdot x + 2.2$. This is the mathematical bottom of the loss valley that Gradient Descent was slowly stepping towards.
414 |
415 | **Practical Tradeoffs**
416 |
417 | | Feature | Gradient Descent | Normal Equation |
418 | | :-------------------- | :--------------------------------------------- | :--------------------------------------------- |
419 | | **Process** | Iterative, takes many small steps. | Direct, one-shot calculation. |
420 | | **Scalability** | Works well with huge datasets (millions of features). | Computationally expensive for many features (inverting a large matrix is slow). |
421 | | **Learning Rate** | Requires choosing a learning rate, $\alpha$. | No hyperparameters to tune. |
422 | | **When to Use** | The default for most large-scale ML problems. | Excellent for smaller datasets where the number of features is not too large (e.g., < 10,000). |
423 |
424 | ## **Section 6: Conclusion: From Our Simple Model to the Real World**
425 |
426 | In the last 20 minutes, we have built a complete machine learning model from the ground up. Let's retrace our steps.
427 |
428 | We started with a simple dataset and a clear goal: predict an exam score from hours studied. To achieve this, we assembled a three-part engine:
429 |
430 | 1. **The Model (Hypothesis Function):** We defined a straight line as our hypothesis for how the data behaves.
431 | * **Formula:** $\hat{y} = w \cdot x + b$
432 |
433 | 2. **The Loss Function (Cost Function):** We chose Mean Squared Error (MSE) to give us a single, precise number that quantifies how "wrong" our model's predictions are.
434 | * **Formula:** $J(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$
435 |
436 | 3. **The Optimizer (Gradient Descent):** We used Gradient Descent to iteratively step towards the best `w` and `b` that minimize the MSE.
437 | * **Update Rule:**
438 | $w \leftarrow w - \alpha \frac{\partial J}{\partial w}$
439 | $b \leftarrow b - \alpha \frac{\partial J}{\partial b}$
440 |
441 | 4. **The Analytical Solution (Normal Equation):** We also saw how the Normal Equation can solve for the optimal parameters directly in one calculation.
442 | * **Formula:** $\theta = (X^T X)^{-1} X^T y$
443 |
444 | These components—**model, loss, and optimizer**—are the fundamental building blocks of most supervised machine learning.
445 |
446 | You now understand the engine that powers a significant portion of data science and machine learning. You've built the foundation.
--------------------------------------------------------------------------------
/docs/stats/hypothesis.md:
--------------------------------------------------------------------------------
1 | # Practical Hypothesis Testing: From P-Values to Decision Rules
2 |
3 | ## The Promise
4 | By the end of this tutorial, you will master the algorithmic decision engine used in A/B testing, clinical trials, and algorithmic trading. You will move beyond "guessing" to a formalized binary decision process based on probability density.
5 |
6 | You will learn to manipulate these variables:
7 | * **$H_0$ (Null Hypothesis):** The default state of the world.
8 | * **$H_1$ (Alternative Hypothesis):** The claim you are testing.
9 | * **$\alpha$ (Alpha):** The significance level (usually 0.05).
10 | * **$Z$ (Test Statistic):** The standardized distance from the mean.
11 | * **$P$ (P-Value):** The probability of observing the data assuming $H_0$ is true.
12 |
13 | ### The Universal Algorithm
14 | Hypothesis testing is not magic. It is a 4-step function.
15 |
16 | $$
17 | \text{Decision} = f(\text{Data}, \text{Threshold})
18 | $$
19 |
20 | **The 4-Step Process:**
21 | 1. **Formulate:** Define $H_0$ (Status Quo) and $H_1$ (New Theory).
22 | 2. **Calibrate:** Set $\alpha$ (Error tolerance, e.g., 5%).
23 | 3. **Compute:** Calculate the test statistic using the relevant formula.
24 | * *Example (Z-Test):* $Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$
25 | 4. **Decide:**
26 | * If $P \text{-value} < \alpha$: **Reject $H_0$**. (Significant)
27 | * If $P \text{-value} \ge \alpha$: **Fail to Reject $H_0$**. (Not Significant)
28 |
29 | ### The Landscape of Tests
30 | While the logic above remains constant, the **Step 3** formula changes based on data type and volume. You do not need to memorize these now, but acknowledge the hierarchy:
31 |
32 | | Data Type | Sample Size | Comparisons | Test Name |
33 | | :--- | :--- | :--- | :--- |
34 | | **Continuous** (e.g., Height, Price) | Large ($n > 30$) | Mean vs. Target | **Z-Test** |
35 | | **Continuous** | Small ($n < 30$) | Mean vs. Target | **T-Test** |
36 | | **Continuous** | Any | 3+ Groups | **ANOVA** |
37 | | **Categorical** (e.g., Click/No Click) | Any | Frequency counts | **Chi-Square** |
38 |
39 | This tutorial focuses on the **Z-Test** to build the core intuition, which applies to all other tests.
40 |
41 | ## Section 1: Intuition First - The Suspicious Coin
42 |
43 | We start with a concrete problem to build the mental model.
44 |
45 | **The Problem:**
46 | A friend hands you a coin and claims it is "fair" (50/50 chance). You bet on the next 10 flips.
47 | You flip the coin 10 times.
48 | **Result:** 9 Heads, 1 Tail.
49 |
50 | **The Question:**
51 | Is your friend cheating, or did you just get lucky?
52 |
53 | **The Concrete Solution (The "BS Detector"):**
54 | We cannot strictly *prove* the coin is rigged. Instead, we assume the coin is **fair** and calculate the probability of observing this result.
55 |
56 | If the probability is too low, we call "BS" (Bullshit).
57 |
58 | **Step 1: The Math**
59 | If the coin is fair ($P_{head} = 0.5$), the probability of getting *exactly* 10 Heads is:
60 | $$0.5^{10} \approx 0.00097$$
61 |
62 | The probability of getting *exactly* 9 Heads is:
63 | $$10 \times 0.5^{10} \approx 0.00976$$
64 |
65 | The probability of getting **at least** 9 heads (9 or 10) is the sum:
66 | $$0.00097 + 0.00976 \approx 0.0107$$
67 |
68 | **Result:** There is a **1.07%** chance a fair coin does this.
69 |
70 | **Step 2: The Decision Rule**
71 | Before the game, you have an internal "BS Threshold." Usually, if something has less than a 5% chance of happening by accident, we conclude it wasn't an accident.
72 |
73 | * **Observation:** 1.07% chance.
74 | * **Threshold:** 5%.
75 | * **Logic:** $1.07\% < 5\%$.
76 | * **Conclusion:** The event is too rare to be luck. The coin is likely rigged.
77 |
78 | ### Abstracting the Solution
79 |
80 | Now we map this specific scenario to the formal statistical terms used in the industry.
81 |
82 | | Concrete Coin Scenario | Formal Statistical Term | Symbol |
83 | | :--- | :--- | :--- |
84 | | "The coin is fair." | **Null Hypothesis** | $H_0$ |
85 | | "The coin is rigged." | **Alternative Hypothesis** | $H_1$ |
86 | | 1.07% (Calculated chance of luck) | **P-Value** | $p$ |
87 | | 5% (Your BS Threshold) | **Significance Level** | $\alpha$ |
88 |
89 | **The Logic Visualized:**
90 |
91 | ```text
92 | Shape: 3D Bar Chart
93 | X-axis: Number of Heads (0 to 10)
94 | Y-axis: Probability
95 |
96 | Description:
97 | - The bars form a bell-curve shape centered at 5.
98 | - The bar at 5 is the tallest (most likely).
99 | - The bars at 9 and 10 are tiny, red slivers at the far right edge.
100 | - An arrow points to 9 and 10 labeled: "The Rejection Region (P-Value)"
101 | ```
102 |
103 | ### Code Snippet: Calculating the P-Value
104 |
105 | In practice, you don't do the math by hand. You use a library.
106 |
107 | **Input:**
108 | * Successes (Heads): 9
109 | * Trials (Flips): 10
110 | * Expected Probability ($H_0$): 0.5
111 |
112 | ```python
113 | from scipy.stats import binomtest
114 |
115 | # Inputs
116 | k = 9 # Number of heads
117 | n = 10 # Number of flips
118 | p = 0.5 # Null hypothesis probability
119 |
120 | # Calculate P-Value
121 | result = binomtest(k, n, p, alternative='greater')
122 |
123 | print(f"P-Value: {result.pvalue:.4f}")
124 | ```
125 |
126 | **Output:**
127 | ```text
128 | P-Value: 0.0107
129 | ```
130 |
131 | **Summary of Section 1:**
132 | We rejected the "Fair Coin" hypothesis because the p-value (0.0107) was lower than our alpha (0.05). We formally decided: **Reject $H_0$**.
133 |
134 | ## Phase 2: The Yardstick (Normal Distribution & Z-Test)
135 |
136 | **Context in Big Picture:** In Phase 1, we counted discrete events (Heads/Tails). But most business data is **continuous** (Revenue, Latency, Weight). You cannot calculate "probability of exactly \$100.05 revenue" because the possibilities are infinite.
137 |
138 | We need a new yardstick. We rely on the **Normal Distribution** and the **Z-Score**.
139 |
140 | ### 1. The Concrete Problem
141 | You manage a chip factory.
142 | * **Target Weight:** 100g.
143 | * **Known Variation ($\sigma$):** The machine historically deviates by 5g (Standard Deviation).
144 |
145 | You take a sample of **50 chips**.
146 | * **Sample Average ($\bar{X}$):** 102g.
147 |
148 | **Question:** Is the machine broken (drifting to 102g)? Or is this random noise?
149 |
150 | ### 2. The Tool: Central Limit Theorem (CLT)
151 | Why can we solve this?
152 | **The CLT Promise:** No matter how weird or messy your individual data points are, if you take enough samples and average them, the **distribution of those averages** will form a perfect Bell Curve (Normal Distribution).
153 |
154 | Because of this, we don't need to know the shape of the raw data. We only analyze the **Sample Mean**.
155 |
156 | ### 3. The Formula: The Z-Score
157 | The Z-Score converts your specific problem (grams, dollars, seconds) into a universal "Standard Deviation units."
158 |
159 | $$Z = \frac{\text{Signal}}{\text{Noise}} = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}$$
160 |
161 | * $\bar{X}$: Sample Mean (102g)
162 | * $\mu$: Target Population Mean (100g)
163 | * $\sigma$: Population Standard Deviation (5g)
164 | * $n$: Sample Size (50)
165 | * $\sigma / \sqrt{n}$: **Standard Error** (The volatility of the *average*, not the individual).
166 |
167 | ### 4. Step-by-Step Execution
168 | Let's run the Master Algorithm on the factory data.
169 |
170 | 1. **Define $H_0$:** $\mu = 100$ (Machine is fine).
171 | 2. **Collect Data:** $\bar{X} = 102$, $n = 50$.
172 | 3. **Calculate Z-Score:**
173 | * Signal: $102 - 100 = 2$
174 | * Noise (Standard Error): $5 / \sqrt{50} \approx 0.707$
175 | * $Z = 2 / 0.707 \approx \mathbf{2.83}$
176 | 4. **Compute P-Value:**
177 | * How rare is a Z-score of 2.83?
178 | * In a standard curve, 95% of data is between -1.96 and +1.96.
179 | * 2.83 is way outside.
180 | * $P(Z > 2.83) \approx 0.0023$ (0.23%).
181 | 5. **Decision:**
182 | * $0.0023 < 0.05$. **Reject $H_0$**.
183 | * **Reality Check:** The machine is broken. Stop the line.
184 |
185 | ### 5. Code Example (Z-Test)
186 | Calculating tail probabilities manually is hard. Use Python.
187 |
188 | ```python
189 | import numpy as np
190 | from scipy.stats import norm
191 |
192 | # Inputs
193 | mu_target = 100
194 | sigma = 5
195 | n = 50
196 | x_bar = 102
197 |
198 | # Calculate Z
199 | std_error = sigma / np.sqrt(n)
200 | z_score = (x_bar - mu_target) / std_error
201 |
202 | # Calculate P-value (Two-tailed)
203 | # We multiply by 2 because we care if it's too high OR too low
204 | p_value = 2 * (1 - norm.cdf(abs(z_score)))
205 |
206 | print(f"Z-Score: {z_score:.2f}")
207 | print(f"P-Value: {p_value:.4f}")
208 | ```
209 |
210 | **Output:**
211 | ```text
212 | Z-Score: 2.83
213 | P-Value: 0.0047
214 | ```
215 |
216 | ### 6. Visualizing the Z-Score
217 | The Z-score maps your data onto the Standard Normal Curve (Mean=0, SD=1).
218 |
219 | ```
220 | Standard Normal Distribution (The Yardstick)
221 |
222 | ^
223 | | Center (0)
224 | | |
225 | | __|__
226 | | _ / \ _
227 | | / \
228 | | / \
229 | | __|___________________|__ [Rejection Zone]
230 | ______|_|_|___________________|_|___X____
231 | Z: -2 -1 0 1 2 ^
232 | |
233 | 2.83 (Observed)
234 |
235 | The observed Z (2.83) falls into the rejection tail.
236 | The probability of landing here by luck is tiny.
237 | ```
238 |
239 | ### 7. Connect to Reality
240 | We just determined that a 2g difference is huge. Why? Because we had a sample size of 50.
241 | **Key Insight:** If $n$ was only 5, the Standard Error ($\sigma/\sqrt{n}$) would be larger, the Z-score would be smaller, and we might *fail* to reject.
242 | **More Data = More Sensitivity to small differences.**
243 |
244 | ## Phase 3: Handling Unknowns (Student’s T-Distribution)
245 |
246 | **Context in Big Picture:** Phase 2 (Z-Test) required you to know the global Standard Deviation ($\sigma$). In the real world, you almost never know this. You only have your small sample. When you rely on sample data to estimate variance, you add uncertainty. The **T-Test** accounts for this.
247 |
248 | ### 1. The Concrete Problem (A/B Testing)
249 | You are testing a new website design.
250 | * **Group A (Control):** 10 users. Average time on site = 45s.
251 | * **Group B (Treatment):** 10 users. Average time on site = 55s.
252 |
253 | **Question:** Is the 10s improvement real, or just random noise from small sample sizes?
254 | **Constraint:** You do not know the standard deviation of the entire internet population. You only have the variation inside your two groups ($s$).
255 |
256 | ### 2. The Solution: Student’s T-Distribution
257 | Because we are estimating the variance using the sample ($s$) instead of the population ($\sigma$), our predictions are less precise.
258 |
259 | If we used the Normal Distribution (Z), we would be overconfident.
260 | **The Fix:** The **T-Distribution**. It looks like a Normal Distribution but shorter with **fatter tails**. The fat tails represent the "penalty" for not knowing the true population variance.
261 |
262 | ### 3. The Formula: T-Statistic (Two-Sample)
263 | We compare the difference between two groups relative to their combined volatility.
264 |
265 | $$t = \frac{\text{Difference in Means}}{\text{Combined Noise}} = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$
266 |
267 | * $\bar{X}_1 - \bar{X}_2$: The Signal (10s difference).
268 | * $s_1^2, s_2^2$: Sample Variances (The noise inside each group).
269 | * $n_1, n_2$: Sample sizes.
270 |
271 | ### 4. Step-by-Step Execution
272 | 1. **Define $H_0$:** $\mu_A = \mu_B$ (The new design changes nothing).
273 | 2. **Collect Data:**
274 | * Group A: Mean=45, SD=15, n=10.
275 | * Group B: Mean=55, SD=15, n=10.
276 | 3. **Calculate T-Score:**
277 | * Signal: $55 - 45 = 10$.
278 | * Noise: $\sqrt{\frac{225}{10} + \frac{225}{10}} = \sqrt{45} \approx 6.7$.
279 | * $t = 10 / 6.7 \approx \mathbf{1.49}$.
280 | 4. **Compute P-Value:**
281 | * We look up $t=1.49$ with "degrees of freedom" ($n_1 + n_2 - 2 = 18$).
282 | * P-value $\approx 0.15$.
283 | 5. **Decision:**
284 | * $0.15 > 0.05$. **Fail to Reject $H_0$**.
285 | * **Conclusion:** The 10s increase is likely noise. You do not have enough evidence to claim the new design works.
286 |
287 | ### 5. Code Example (T-Test)
288 | This is the most common statistical function you will use in Python.
289 |
290 | ```python
291 | from scipy import stats
292 | import numpy as np
293 |
294 | # Fake data representing time on site (seconds)
295 | group_a = np.array([40, 50, 45, 30, 60, 42, 48, 35, 55, 45]) # Control
296 | group_b = np.array([50, 60, 55, 40, 70, 52, 58, 45, 65, 55]) # Treatment
297 |
298 | # Run Independent Two-sample T-test
299 | t_stat, p_val = stats.ttest_ind(group_b, group_a)
300 |
301 | print(f"T-Statistic: {t_stat:.2f}")
302 | print(f"P-Value: {p_val:.4f}")
303 | ```
304 |
305 | **Output:**
306 | ```text
307 | T-Statistic: 2.28
308 | P-Value: 0.0350
309 | ```
310 | *(Note: In this specific code snippet, the data was cleaner, resulting in a significant P-value).*
311 |
312 | ### 6. Visualizing T vs Normal
313 | The T-distribution changes shape based on Sample Size ($n$).
314 |
315 | ```
316 | Comparision of Curves
317 |
318 | ^
319 | | _..-'''-.._ <-- Normal Distribution (Z)
320 | | .' | '. (High Peak, Thin Tails)
321 | | / _..-|-.._ \
322 | | / .' | '. \ <-- T-Distribution (n=5)
323 | | | / | \ | (Lower Peak, Fatter Tails)
324 | ______|_\_|_______|_______|_/____
325 | -3 0 3
326 |
327 | The Fatter Tail means you need a HIGHER score to prove significance.
328 | The T-test is "conservative." It forces you to prove it more effectively
329 | because your sample size is small.
330 | ```
331 |
332 | ### 7. Reality Check
333 | We just failed to reject the null in the manual example ($p=0.15$), even though the average improved by 10 seconds.
334 | **The Lesson:** Big impact doesn't matter if your variance ($s$) is high or your sample size ($n$) is low. To pass the T-test, you either need a **massive improvement** or **more data** to shrink the variance.
335 |
336 | ## Phase 4: Categorical Differences (Chi-Square Test)
337 |
338 | **Context in Big Picture:** The Z-test and T-test work for **averages** of continuous numbers (Height, Money, Time). But what if your data is **categorical**?
339 | * Did the user Click or Not Click?
340 | * Did the customer Buy or Not Buy?
341 |
342 | You cannot calculate the "average" of a Click. You can only count frequencies. For this, we use the **Chi-Square ($\chi^2$) Test**.
343 |
344 | ### 1. The Concrete Problem (Conversion Rate)
345 | You run an E-commerce site. You test a new "Buy" button color.
346 | * **Control (Blue):** 100 visitors, 10 bought. (10% rate)
347 | * **Variant (Red):** 100 visitors, 20 bought. (20% rate)
348 |
349 | **Question:** Is the Red button actually better, or did you just get lucky with those specific 100 visitors?
350 |
351 | ### 2. The Logic: Observed vs. Expected
352 | The logic shifts from "Comparing Means" to "Comparing Tables."
353 |
354 | * **Observed ($O$):** What actually happened.
355 | * **Expected ($E$):** What *should* have happened if the button color didn't matter ($H_0$).
356 |
357 | We measure the logical "distance" between $O$ and $E$. If the distance is large, $H_0$ is false.
358 |
359 | ### 3. The Formula: Chi-Square Statistic
360 | $$ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} $$
361 |
362 | * $O_i$: Observed count in a cell.
363 | * $E_i$: Expected count in a cell.
364 | * We square the difference to get rid of negatives.
365 | * We divide by expected to normalize (an error of 5 is huge if expected is 10, but tiny if expected is 10,000).
366 |
367 | ### 4. Step-by-Step Execution
368 | 1. **Define $H_0$:** Color has no effect on sales.
369 | 2. **Build Observed Table:**
370 | * Total Visitors: 200.
371 | * Total Sales: 30.
372 | * Global Conversion Rate: $30/200 = 15\%$.
373 |
374 | | Group | Bought | Didn't Buy | Total |
375 | | :--- | :--- | :--- | :--- |
376 | | **Blue** | 10 ($O_1$) | 90 | 100 |
377 | | **Red** | 20 ($O_2$) | 80 | 100 |
378 |
379 | 3. **Calculate Expected Table ($H_0$ True):**
380 | If $H_0$ is true, both groups should convert at the Global Rate (15%).
381 | * Blue Expected Buys: $100 \times 0.15 = 15$ ($E_1$).
382 | * Red Expected Buys: $100 \times 0.15 = 15$ ($E_2$).
383 |
384 | 4. **Calculate $\chi^2$:**
385 | * Blue Buys: $(10 - 15)^2 / 15 = 25/15 = 1.66$
386 | * Red Buys: $(20 - 15)^2 / 15 = 25/15 = 1.66$
387 | * *Do the same for "Didn't Buy" cells...*
388 | * $\chi^2 \approx 3.92$.
389 |
390 | 5. **Compute P-Value:**
391 | * Degrees of Freedom = 1.
392 | * Look up $\chi^2 = 3.92$.
393 | * P-value $\approx 0.047$.
394 |
395 | 6. **Decision:**
396 | * $0.047 < 0.05$. **Reject $H_0$**.
397 | * **Result:** The Red button really works.
398 |
399 | ### 5. Code Example (Chi-Square)
400 | We do not construct expected tables by hand. Python does it automatically.
401 |
402 | ```python
403 | from scipy.stats import chi2_contingency
404 |
405 | # Input: The Observed Table
406 | # [[Bought, No Buy] (Control), [Bought, No Buy] (Variant)]
407 | data = [[10, 90],
408 | [20, 80]]
409 |
410 | chi2, p, dof, expected = chi2_contingency(data)
411 |
412 | print(f"Chi2 Stat: {chi2:.2f}")
413 | print(f"P-Value: {p:.4f}")
414 | ```
415 |
416 | **Output:**
417 | ```text
418 | Chi2 Stat: 3.92
419 | P-Value: 0.0477
420 | ```
421 |
422 | ### 6. Visualizing the Logic
423 | Think of this as fitting a grid over reality.
424 |
425 | ```
426 | Expected Reality (H0) Observed Reality (Data)
427 | _______________________ _______________________
428 | | | | | | |
429 | | 15 | 85 | | 10 | 90 | <--- Control
430 | |__________|____________| |__________|____________|
431 | | | | | | |
432 | | 15 | 85 | | 20 | 80 | <--- Variant
433 | |__________|____________| |__________|____________|
434 |
435 | ^ ^
436 | |____________________________________|
437 | Measure
438 | Difference (Chi^2)
439 | ```
440 |
441 | ### 7. Reality Check: Sample Size Sensitivity
442 | We rejected $H_0$ with a P-value of 0.047. This is barely significant.
443 | **The Reality:** If the Red button had 19 sales instead of 20, the P-value would jump to > 0.05, and we would conclude "No Difference."
444 | In categorical data (Clicks/Conversions), you need **massive sample sizes** to detect small differences because the data is "low resolution" (just Yes/No, not a precise number).
445 |
446 | ## Phase 5: Production Reality
447 |
448 | **Context in Big Picture:** You now know the math that runs inside the black boxes of tools like Optimizely, Google Optimize, or Tableau. In the real world, you rarely calculate these by hand. Your job is to **select the right test** and **interpret the output correctly**.
449 |
450 | ### 1. The "God View": Choosing the Right Test
451 | You don't need to memorize formulas. You need to memorize this decision matrix. Look at your data, identify the type, and pick the algorithm.
452 |
453 | | Data Type | Question Example | The Variable | The Test |
454 | | :--- | :--- | :--- | :--- |
455 | | **Categorical** | Did they click? (Yes/No) | Conversion Rate | **Chi-Square Test** |
456 | | **Continuous** | Did they spend more? ($) | Average Revenue (Unknown $\sigma$) | **Two-Sample T-Test** |
457 | | **Continuous** | Is the machine calibrated? | Average Weight (Known $\sigma$) | **Z-Test** |
458 | | **Binary** | Is the coin rigged? | Success Count | **Binomial Test** |
459 |
460 | ### 2. Production Pitfall: P-Hacking (Peeking)
461 | **The Trap:** You run an A/B test. You check the P-value every hour.
462 | * Hour 1: $p = 0.08$ (Not significant)
463 | * Hour 4: $p = 0.04$ (Significant! Stop!)
464 | * Hour 5: $p = 0.12$ (Not significant)
465 |
466 | **The Reality:** If you check often enough, you *will* eventually see $p < 0.05$ by random chance. This is cheating.
467 | **The Solution:** Calculate your required sample size **before** you start. Run the test until you hit that number. Do not peek.
468 |
469 | ### 3. Production Pitfall: Statistical vs. Practical Significance
470 | **The Scenario:** You test a new font on Google Search.
471 | * Sample Size: 10 Million users.
472 | * Result: Search time improves by 0.0001 seconds.
473 | * P-Value: $< 0.0000001$.
474 |
475 | **The Verdict:**
476 | * **Statistically Significant?** Yes. The math proves the difference is real.
477 | * **Practically Significant?** No. 0.0001 seconds is invisible to a human.
478 | * **Lesson:** Low P-value $\neq$ Big Impact. It just means "High Certainty." Always check the **Effect Size** (the actual magnitude of the difference).
479 |
480 | ### 4. Summary: The 40-Minute Journey
481 | We started with a promise: to learn the universal algorithm of inference.
482 |
483 | 1. **The Framework:** $H_0$, Data, Test Statistic, P-value, Decision.
484 | 2. **The Yardstick:** Standard Error measures how "noisy" the data is.
485 | 3. **The Ratio:** All tests follow the same pattern:
486 | $$ \text{Score} = \frac{\text{Observed Signal}}{\text{Expected Noise}} $$
487 | 4. **The Decision:** If the Signal is much stronger than the Noise (High Score, Low P-value), we accept the new reality.
488 |
489 | **You have now graduated from "guessing" to "inference."**
--------------------------------------------------------------------------------
/docs/rl/multi-armed bandits.md:
--------------------------------------------------------------------------------
1 | ## Reinforcement Learning: Multi-Armed Bandits
2 |
3 | ### Part 1: The Big Idea - What's Your Casino Strategy?
4 |
5 | Before we dive in, let's talk about the big idea that separates Reinforcement Learning (RL) from other types of machine learning.
6 |
7 | Most of the time, when we teach a machine, we give it **instructions**. This is called *supervised learning*. It's like having a math teacher who shows you the correct answer (`5 + 5 = 10`) and tells you to memorize it. The feedback is **instructive**: "This is the right way to do it."
8 |
9 | Reinforcement Learning is different. It learns from **evaluation**. It's like a critic watching you perform. After you take an action, the critic just tells you how good or bad that action was—a score. It doesn't tell you what you *should have* done. The feedback is **evaluative**: "That was a 7/10." This creates a problem: to find the best actions, you must actively search for them yourself.
10 |
11 | This need to search for good behavior is what we're going to explore, using a classic problem that makes it crystal clear.
12 |
13 | #### The k-Armed Bandit: A Casino Dilemma
14 |
15 | Imagine you walk into a casino and see a row of `k` slot machines. In our lingo, this is a **`k`-armed bandit**. Each machine (or "arm") is an **action** you can take.
16 |
17 | * You have a limited number of tokens, say 1000.
18 | * Each machine has a different, hidden probability of paying out a jackpot. This hidden average payout is the machine's true **value**.
19 | * Your goal is simple: **Walk away with the most money possible.**
20 |
21 | How do you play? This isn't just a brain teaser; it’s the perfect analogy for the most important trade-off in reinforcement learning.
22 |
23 | #### Formalizing the Problem (The Simple Math)
24 |
25 | Let's put some labels on our casino game. Don't worry, the math is just a way to be precise.
26 |
27 | * An **action** `a` is the choice of which of the `k` levers to pull.
28 | * The action you choose at time step `t` (e.g., your first pull, `t=1`) is called `A_t`.
29 | * The **reward** you get from that pull is `R_t`.
30 | * Each action `a` has a true mean reward, which we call its **value**, denoted as `q*(a)`.
31 |
32 | This value is the reward we *expect* to get on average from that machine. Formally, it's written as:
33 |
34 | `q*(a) = E[R_t | A_t = a]`
35 |
36 | In plain English, this means: "**The true value of an action `a` is the expected (average) reward you'll get, given you've selected that action.**"
37 |
38 | The catch? **You don't know the true values `q*(a)`!** You have to discover them by playing the game.
39 |
40 | #### The Core Conflict: Exploration vs. Exploitation
41 |
42 | This is where the dilemma hits. With every token you spend, you face a choice:
43 |
44 | 1. **Exploitation:** You've tried a few machines, and one of them seems to be paying out more than the others. Exploitation means you stick with that machine because, based on your current knowledge, it's your best bet to maximize your reward *right now*.
45 |
46 | 2. **Exploration:** You deliberately try a different machine—one that seems worse, or one you haven't even tried yet. Why? Because it *might* be better than your current favorite. You are exploring to improve your knowledge of the world.
47 |
48 | > **The Conflict:** You cannot explore and exploit with the same token. Every time you explore a potentially worse machine, you give up a guaranteed good-ish reward from your current favorite. But if you only ever exploit, you might get stuck on a decent machine, never discovering the true jackpot next to it.
49 |
50 | This is the **exploration-exploitation dilemma**. It is arguably the most important foundational concept in reinforcement learning. Finding a good strategy to balance this trade-off is the key to creating intelligent agents.
51 |
52 | In the next section, we'll look at a simple but flawed strategy for solving this problem.
53 |
54 | ### Part 2: A Simple (But Flawed) Strategy - The "Greedy" Approach
55 |
56 | So, how would most people play the slot machine game? The most straightforward strategy is to be "greedy."
57 |
58 | A greedy strategy works in two phases:
59 | 1. **Estimate:** Keep a running average of the rewards you've gotten from each machine.
60 | 2. **Exploit:** Always pull the lever of the machine that has the highest average so far.
61 |
62 | This sounds reasonable, right? You're using the data you've collected to make the most profitable choice at every step. Let's formalize this.
63 |
64 | #### How to Estimate Action Values
65 |
66 | Since we don't know the true value `q*(a)` of a machine, we have to estimate it. We'll call our estimate at time step `t` **`Q_t(a)`**. The simplest way to do this is the **sample-average method**:
67 |
68 | `Q_t(a) = (sum of rewards when action a was taken before time t) / (number of times action a was taken before time t)`
69 |
70 | This is just a simple average. If you pulled lever 1 three times and got rewards of 5, 7, and 3, your estimated value for lever 1 would be `(5+7+3)/3 = 5`.
71 |
72 | #### The Greedy Action Selection Rule
73 |
74 | The greedy rule is to always select the action with the highest estimated value. We write this as:
75 |
76 | `A_t = argmax_a Q_t(a)`
77 |
78 | The `argmax_a` part looks fancy, but it just means "**find the action `a` that maximizes the value of `Q_t(a)`**." If two machines are tied for the best, you can just pick one of them randomly.
79 |
80 | #### Why the Greedy Strategy Fails
81 |
82 | The greedy method has a fatal flaw: **it's too quick to judge and never looks back.** It gets stuck on the first "good enough" option it finds, even if it's not the *best* option.
83 |
84 | Let's see this in action with a minimal example.
85 |
86 | **The Setup:**
87 | * A 3-armed bandit problem.
88 | * The true (hidden) values are:
89 | * Machine 1: `q*(1) = 1` (A dud)
90 | * Machine 2: `q*(2) = 5` (Pretty good)
91 | * Machine 3: `q*(3) = 10` (The real jackpot!)
92 | * To start, our agent needs some data, so let's say it tries each machine once.
93 |
94 | **The Game:**
95 | 1. **Pull 1:** The agent tries **Machine 1**. It's a dud, and the reward is `R_1 = 1`.
96 | * Our estimate is now `Q(1) = 1`.
97 |
98 | 2. **Pull 2:** The agent tries **Machine 2**. The reward is a lucky `R_2 = 7`.
99 | * Our estimate is now `Q(2) = 7`.
100 |
101 | 3. **Pull 3:** The agent tries **Machine 3** (the true jackpot). By pure bad luck, this one pull gives a disappointing reward of `R_3 = 4`.
102 | * Our estimate is now `Q(3) = 4`.
103 |
104 | **The Trap:**
105 | After these three pulls, our agent's estimates are:
106 | * `Q(1) = 1`
107 | * `Q(2) = 7`
108 | * `Q(3) = 4`
109 |
110 | Now, the greedy strategy kicks in. From this point forward, which machine will the agent choose? It will always choose the `argmax`, which is **Machine 2**.
111 |
112 | The agent will pull the lever for Machine 2 forever. It will never go back to Machine 3, because based on its one unlucky experience, it "believes" Machine 2 is better. It got stuck exploiting a suboptimal action and will **never discover the true jackpot machine**.
113 |
114 | This is why pure exploitation fails. We need a way to force the agent to keep exploring, just in case its initial estimates were wrong. That brings us to our first real solution.
115 |
116 | ### Part 3: A Smarter Strategy - The ε-Greedy (Epsilon-Greedy) Method
117 |
118 | The greedy strategy failed because it was too stubborn. Once it found a "good enough" option, it never looked back. The ε-Greedy (pronounced "epsilon-greedy") method fixes this with a very simple and clever rule.
119 |
120 | > **The Big Idea:** "Be greedy most of the time, but every once in a while, do something completely random."
121 |
122 | Think of it like choosing a restaurant for dinner.
123 | * **Exploitation (Greed):** You go to your favorite pizza place because you know it's good.
124 | * **Exploration (Randomness):** Once a month, you ignore your favorites and just pick a random restaurant from Google Maps. You might end up at a terrible place, but you might also discover a new favorite!
125 |
126 | This small dose of randomness is the key. It forces the agent to keep exploring all the options, preventing it from getting stuck.
127 |
128 | #### The ε-Greedy Rule
129 |
130 | Here's how it works. We pick a small probability, called **epsilon (ε)**, usually a value like 0.1 (which means 10%).
131 |
132 | At every time step, the agent does the following:
133 | 1. Generate a random number between 0 and 1.
134 | 2. **If the number is greater than ε:** **Exploit**. Choose the action with the highest estimated value, just like the greedy method.
135 | * This happens with probability `1 - ε` (e.g., 90% of the time).
136 | 3. **If the number is less than or equal to ε:** **Explore**. Choose an action completely at random from *all* available actions, with equal probability.
137 | * This happens with probability `ε` (e.g., 10% of the time).
138 |
139 | #### Why ε-Greedy Works
140 |
141 | Let's revisit our "Greedy Trap" from the previous section. Our agent was stuck forever pulling the lever for Machine 2, never realizing Machine 3 was the true jackpot.
142 |
143 | How would an ε-Greedy agent with `ε = 0.1` handle this?
144 |
145 | * **90% of the time**, it would look at its estimates (`Q(1)=1`, `Q(2)=7`, `Q(3)=4`) and greedily choose **Machine 2**.
146 | * **But 10% of the time**, it would ignore its estimates and pick a machine at random. This means it has a chance of picking Machine 1, Machine 2, or **Machine 3**.
147 |
148 | Eventually, that 10% chance will cause it to try **Machine 3** again. And again. And again. As it gets more samples from Machine 3, its estimated value `Q(3)` will slowly climb from that unlucky `4` towards the true value of `10`.
149 |
150 | Once `Q(3)` becomes greater than `Q(2)`, the agent's "greedy" choice will switch! Now, 90% of the time, it will exploit the *correct* jackpot machine.
151 |
152 | #### The Guarantee
153 |
154 | The advantage of this method is huge: in the long run, as the number of plays increases, every single machine will be sampled many, many times. Because of this, the **Law of Large Numbers** tells us that our estimated values `Q_t(a)` will eventually converge to the true values `q*(a)`.
155 |
156 | This guarantees that the agent will eventually figure out which action is best and will select it most of the time. It solves the "getting stuck" problem completely. Now, let's see exactly how this works with a step-by-step example.
157 |
158 | ### Part 4: Let's Play! A Step-by-Step Walkthrough
159 |
160 | Seeing is believing. We're going to simulate a few turns of an ε-Greedy agent to watch its "brain" update.
161 |
162 | #### The Setup
163 |
164 | * **The Game:** A 3-armed bandit problem.
165 | * **The Hidden Truth:** The true average payouts (`q*(a)`) are:
166 | * `q*(1) = 2` (Dud)
167 | * `q*(2) = 6` (The Jackpot!)
168 | * `q*(3) = 4` (Decent)
169 | * *The agent does not know these numbers.*
170 | * **Our Agent's Strategy:** ε-Greedy with `ε = 0.1` (10% chance to explore).
171 | * **Initial State:** The agent starts with no knowledge. Its estimated values (`Q`) and pull counts (`N`) for each arm are all zero.
172 | * `Q(1)=0, Q(2)=0, Q(3)=0`
173 | * `N(1)=0, N(2)=0, N(3)=0`
174 |
175 | #### The Incremental Update Formula
176 |
177 | As we get new rewards, we need to update our `Q` values efficiently. We won't re-calculate the average from scratch every time. Instead, we use a simple incremental formula.
178 |
179 | When we choose action `A` and get reward `R`:
180 | 1. First, increment the count for that action: `N(A) = N(A) + 1`
181 | 2. Then, update the value estimate with this formula:
182 |
183 | `Q(A) = Q(A) + (1/N(A)) * [R - Q(A)]`
184 |
185 | Let's break this down:
186 | * `[R - Q(A)]` is the **error**: the difference between the new reward and what we expected.
187 | * We take a "step" to correct this error, with the step size `1/N(A)`. Notice that as we sample an action more (`N(A)` gets bigger), the step size gets smaller. This means our estimates become more stable and less affected by single random rewards over time.
188 |
189 | #### The Game Begins
190 |
191 | Let's follow the agent for the first 7 pulls. We will track everything in a table.
192 |
193 | | Step (t) | Agent's Decision | Action (A_t) | Reward (R_t) | Agent's Updated Brain: N(a) and Q(a) |
194 | | :--- | :--- | :--- | :--- | :--- |
195 | | **Start** | --- | --- | --- | `N=(0,0,0)`, `Q=(0,0,0)` |
196 | | **1** | All Qs are 0, must pick randomly. | **Arm 1** | `R=1` | `N=(1,0,0)`, `Q=(1, 0, 0)`
_`Q(1) = 0 + 1/1 * (1-0) = 1`_ |
197 | | **2** | `argmax` is Arm 1. Roll is > 0.1 -> **EXPLOIT**. | **Arm 1** | `R=3` | `N=(2,0,0)`, `Q=(2, 0, 0)`
_`Q(1) = 1 + 1/2 * (3-1) = 2`_ |
198 | | **3** | `argmax` is Arm 1. Roll is < 0.1 -> **EXPLORE**. Picks randomly. | **Arm 3** | `R=5` | `N=(2,0,1)`, `Q=(2, 0, 5)`
_`Q(3) = 0 + 1/1 * (5-0) = 5`_ |
199 | | **4** | `argmax` is now Arm 3. Roll > 0.1 -> **EXPLOIT**. | **Arm 3** | `R=3` | `N=(2,0,2)`, `Q=(2, 0, 4)`
_`Q(3) = 5 + 1/2 * (3-5) = 4`_ |
200 | | **5** | `argmax` is still Arm 3. Roll > 0.1 -> **EXPLOIT**. | **Arm 3** | `R=6` | `N=(2,0,3)`, `Q=(2, 0, 4.67)`
_`Q(3) = 4 + 1/3 * (6-4) = 4.67`_ |
201 | | **6** | `argmax` is still Arm 3. Roll < 0.1 -> **EXPLORE**. Picks randomly. | **Arm 2** | `R=8` | `N=(2,1,3)`, `Q=(2, 8, 4.67)`
_`Q(2) = 0 + 1/1 * (8-0) = 8`_ |
202 | | **7** | **`argmax` is now Arm 2!** Roll > 0.1 -> **EXPLOIT**. | **Arm 2** | `R=5` | `N=(2,2,3)`, `Q=(2, 6.5, 4.67)`
_`Q(2) = 8 + 1/2 * (5-8) = 6.5`_ |
203 |
204 | #### Let's Analyze What Happened
205 |
206 | This short sequence shows the power of ε-Greedy in action:
207 | * **Initial Belief:** After two pulls, the agent thought Arm 1 was best (`Q(1)=2`). A purely greedy agent would have gotten stuck here.
208 | * **Discovery through Exploration:** In **Step 3**, a random exploratory action forced the agent to try Arm 3. It got a good reward (`R=5`), and its belief about the best arm changed.
209 | * **Another Discovery:** The agent was happily exploiting Arm 3 until **Step 6**, when another random exploration forced it to try the last unknown, Arm 2. It got a very high reward (`R=8`), and its belief changed again!
210 | * **Nearing the Truth:** After only 7 pulls, the agent's estimates are `Q=(2, 6.5, 4.67)`. These are getting much closer to the true values of `q*=(2, 6, 4)`. Its greedy choice is now correctly focused on the best arm, Arm 2.
211 |
212 | This is the learning process. The agent starts with no idea, forms a belief, and then uses exploration to challenge and refine that belief. Over thousands of steps, this simple mechanism allows it to zero in on the best actions in its environment.
213 |
214 | ### Part 5: Two More "Clever Tricks" for Exploration
215 |
216 | The ε-Greedy method is simple and effective, but its exploration is *random*. It doesn't care if it's exploring a machine it has tried 100 times or one it has never touched. Can we be smarter about how we explore? Yes.
217 |
218 | Here are two popular techniques that add a bit more intelligence to the exploration process.
219 |
220 | #### 1. Optimistic Initial Values: The Power of Positive Thinking
221 |
222 | This is a wonderfully simple trick that encourages a burst of exploration right at the start of learning.
223 |
224 | **The Idea:** Instead of initializing your value estimates `Q(a)` to 0, initialize them to a "wildly optimistic" high number. For example, if you know the maximum possible reward from any machine is 10, you might set all your initial `Q` values to 20.
225 |
226 | `Q_1(a) = 20` for all actions `a`.
227 |
228 | **How it Works:**
229 | 1. On the first step, all actions look equally amazing (`Q=20`). The agent picks one, let's say Arm 1.
230 | 2. It gets a real reward, say `R=5`.
231 | 3. The agent updates its estimate `Q(1)`. The new `Q(1)` will now be a value much lower than 20.
232 | 4. Now, the agent looks at its options again. Arm 1 looks "disappointing" compared to all the other arms, which it still believes have a value of 20.
233 | 5. So, for its next turn, the greedy agent will naturally pick a *different* arm.
234 |
235 | This process continues. Every time an arm is tried, its value drops from the optimistic high, making it look "disappointing" and encouraging the agent to try all the other arms it hasn't touched yet. It’s a self-correcting system that drives the agent to explore everything at least a few times before it starts to settle on the true best option.
236 |
237 | **Key Takeaway:** By being optimistic, a purely greedy agent is tricked into exploring.
238 |
239 | #### 2. Upper-Confidence-Bound (UCB): The "Smart Exploration" Method
240 |
241 | UCB is a more sophisticated approach. It addresses a key question: if we're going to explore, which arm is the *most useful* one to try?
242 |
243 | **The Idea:** The best arm to explore is one that is both:
244 | * **Potentially high-value** (its current `Q(a)` is high).
245 | * **Highly uncertain** (we haven't tried it much, so `Q(a)` could be very wrong).
246 |
247 | UCB combines these two factors into a single score. Instead of just picking the `argmax` of `Q(t)`, it picks the `argmax` of a special formula:
248 |
249 | `A_t = argmax_a [ Q_t(a) + c * sqrt(ln(t) / N_t(a)) ]`
250 |
251 | Let's break that down without fear:
252 | * `Q_t(a)` is our standard value estimate. This is the **exploitation** part.
253 | * The second part, `c * sqrt(ln(t) / N_t(a))`, is the **exploration bonus** or **uncertainty term**.
254 | * `t` is the total number of pulls so far. As `t` increases, this term slowly grows, encouraging exploration over time.
255 | * `N_t(a)` is the number of times we've pulled arm `a`. This is the important part: **as `N_t(a)` increases, the uncertainty bonus shrinks.**
256 | * `c` is a constant that controls how much you favor exploration. A bigger `c` means more exploring.
257 |
258 | **How it Works:**
259 | * If an arm has a good `Q` value but has been tried many times (`N_t(a)` is large), its uncertainty bonus will be small. It's a known quantity.
260 | * If an arm has a mediocre `Q` value but has been tried only a few times (`N_t(a)` is small), its uncertainty bonus will be very large. This makes it an attractive candidate for exploration because its true value could be much higher than we think.
261 |
262 | UCB naturally balances exploration and exploitation. It favors arms it is uncertain about, and as it tries them, its uncertainty decreases, and the `Q` value starts to matter more. It's a more directed and often more efficient way to explore than the random approach of ε-Greedy.
263 |
264 | ### Part 6: So What? Why Bandits Matter
265 |
266 | We’ve journeyed through the k-armed bandit problem, starting with a simple casino analogy and exploring several strategies to solve it. So, what’s the big takeaway?
267 |
268 | #### Summary: The Heart of the Problem
269 |
270 | The multi-armed bandit problem is not really about slot machines. It is a simplified, pure version of the core challenge in all of reinforcement learning: the **exploration-exploitation dilemma**.
271 |
272 | We saw that simple strategies can have major flaws:
273 | * The **Greedy** method gets stuck, failing to find the best option because it never looks back.
274 |
275 | And we saw how to fix it by intelligently balancing the trade-off:
276 | * **ε-Greedy** is a simple and robust solution: it acts greedily most of the time but takes a random exploratory action with a small probability `ε`, ensuring it never gets stuck.
277 | * **Optimistic Initial Values** is a clever trick that uses a purely greedy agent but encourages a natural burst of exploration at the beginning by assuming everything is amazing.
278 | * **Upper-Confidence-Bound (UCB)** is a more sophisticated method that explores strategically, prioritizing actions that are both promising and highly uncertain.
279 |
280 | Each of these methods provides a way to gather information (explore) while trying to maximize rewards with the information you have (exploit).
281 |
282 | #### The Bridge to Real-World Reinforcement Learning
283 |
284 | This "bandit" framework is a fundamental building block. The same principles apply to much more complex problems, both in technology and in real life.
285 |
286 | * **A/B Testing on Websites:** Which version of a headline or button color (`actions`) will get the most clicks (`reward`)? A company can use bandit algorithms to automatically explore different versions and quickly exploit the one that works best, maximizing user engagement in real-time.
287 |
288 | * **Clinical Trials:** Doctors want to find the most effective treatment (`action`) for a disease. Each patient's outcome is a `reward`. Bandit algorithms can help balance giving patients the current best-known treatment (exploit) with trying new, experimental ones that might be even better (explore), potentially saving more lives in the long run.
289 |
290 | * **The Full RL Problem:** The problems we've discussed so far are "non-associative," meaning the best action is always the same. But what if the best action depends on the **situation** or **context**?
291 | * For a self-driving car, the best action (steer, brake) depends on the situation (red light, green light, pedestrian).
292 | * For a game-playing AI, the best move depends on the state of the board.
293 |
294 | This is called **associative search** or **contextual bandits**, and it's the next step towards the full reinforcement learning problem. The agent must learn not just the best action overall, but the best action *for each specific situation*. The methods we learned here—like ε-greedy and UCB—are used as the core decision-making components inside these more advanced AIs.
295 |
296 | By understanding the simple trade-off in the k-armed bandit problem, you have grasped the essential challenge that every reinforcement learning agent must solve to learn and act intelligently in a complex world.
--------------------------------------------------------------------------------
/docs/rl/n-step-bootstrapping.md:
--------------------------------------------------------------------------------
1 | ## n-Step Bootstrapping: Finding the Sweet Spot Between TD and Monte Carlo
2 |
3 | ### Part 1: The Spectrum of Learning - Beyond the Two Extremes
4 |
5 | In our journey so far, we've met two very different kinds of model-free learners, each with its own distinct philosophy.
6 |
7 | 1. **The Monte Carlo Learner (The Super-Patient Learner):** This agent is the ultimate purist. It completes an entire episode, from start to finish, and only then calculates the *true* final reward (`G_t`). It updates its value estimates based on this complete, real outcome. This approach is **unbiased** because it uses the real, final score, but it suffers from **high variance**. A single lucky break or a fluke mistake at the end of a long episode can dramatically swing the final score, sending a very noisy learning signal to all the states and actions that came before it.
8 |
9 | 2. **The One-Step TD Learner (The Myopic, Impatient Learner):** This agent is the opposite. It takes a single step, collects one reward (`R_{t+1}`), and immediately updates its previous estimate using its *new guess* about the future (`V(S_{t+1})`). This is bootstrapping. This approach has **low variance** because its updates are small and based on a stable, learned estimate. However, it can be **biased**, because it's learning from a guess that might be wildly inaccurate, especially early in training.
10 |
11 | This presents us with a classic trade-off. Do we want an accurate but noisy signal (MC), or a stable but potentially flawed one (TD)?
12 |
13 | This begs the question: **Must we choose between these two extremes?** Is there no middle ground between waiting for the absolute end and only looking one single step ahead?
14 |
15 | #### The Weather Forecaster Analogy
16 |
17 | To build our intuition, let's imagine a weather forecaster trying to improve their models. Their goal is to predict the average weather for an entire week, starting from Monday.
18 |
19 | * **The Monte Carlo Forecaster:** On Monday, they make a prediction for the whole week. Then, they wait. On the following Sunday evening, after all seven days of weather have actually happened, they look at the complete, real data and use it to update their prediction model for what "a week starting like this Monday" looks like. This is accurate but slow, and a freak heatwave on Saturday could unfairly make them think their Monday forecast was terrible.
20 |
21 | * **The 1-Step TD Forecaster:** On Monday, they make a prediction for the week. On Tuesday morning, they look at what *actually* happened on Monday. They combine this single day of real data with their *new guess* for the rest of the week (Tuesday through Sunday) to immediately update their original Monday model. This is fast, but it relies heavily on another forecast, which is still just a guess.
22 |
23 | Now, consider a more sensible approach.
24 |
25 | * **The n-Step Forecaster:** On Monday, they make their prediction. They wait a few days. On Thursday morning, they look at the *actual, real weather* for Monday, Tuesday, and Wednesday. This is a solid chunk of real data. They then combine this 3-day reality with their *new, updated forecast* for the rest of the week (Thursday through Sunday). They use this mix of reality and estimation to update their original Monday model.
26 |
27 | This third approach feels more robust. It uses more real information than the 1-step method but doesn't have to wait for the entire episode to finish like the Monte Carlo method. It strikes a balance.
28 |
29 | This is precisely the idea behind **n-step bootstrapping**. It generalizes both Monte Carlo and TD methods, turning them into two endpoints on a single spectrum. By choosing a value for `n`, we can create a "slider" or a "dial" that lets us choose exactly how many real steps of experience we want to use before we bootstrap from one of our own estimates. This gives us a powerful tool to control the bias-variance trade-off and, in many cases, learn much faster than either extreme.
30 |
31 | ### Part 2: The n-Step Return - Mixing Reality with Expectation
32 |
33 | Now that we have the intuition, let's formalize how this "slider" works. The core of any TD method is its **update target**—the value it nudges its current estimate towards. Let's see how this target changes as we adjust `n`.
34 |
35 | First, let's recall the targets we already know. We'll use `G` to represent the target return from a state `S_t`.
36 |
37 | * For **Monte Carlo**, the target is the full, actual return. It's the sum of all future discounted rewards until the episode terminates at time `T`:
38 | `G_t = R_{t+1} + γR_{t+2} + γ^2R_{t+3} + ... + γ^(T-t-1)R_T`
39 | This is 100% reality.
40 |
41 | * For **1-step TD**, the target is the first reward plus the *estimated* value of the next state:
42 | `G_{t:t+1} = R_{t+1} + γV(S_{t+1})`
43 | This is one step of reality (`R_{t+1}`) followed by 100% estimation (`V(S_{t+1})`).
44 |
45 | The pattern is clear. The **n-step return** simply extends this pattern. We take `n` steps of real, observed rewards and then add the discounted *estimated* value of the state we land in after `n` steps.
46 |
47 | The **n-step target** (`G_{t:t+n}`) is defined as:
48 |
49 | `G_{t:t+n} = R_{t+1} + γR_{t+2} + ... + γ^(n-1)R_{t+n} + γ^n V(S_{t+n})`
50 |
51 | This elegant formula defines the entire spectrum of algorithms:
52 | * If `n=1`, we get the 1-step TD target.
53 | * If `n` is very large (specifically, `n ≥ T-t`), the sum includes all the rewards until the end, and the `V(S_{t+n})` term becomes zero because the episode is over. This gives us the Monte Carlo target.
54 | * If `n` is something in between, like 2 or 5 or 10, we get a hybrid method.
55 |
56 | This spectrum can be visualized with backup diagrams. The shaded circles are the estimates (the "guesses") we use to bootstrap.
57 |
58 | ```mermaid
59 | graph TD
60 | subgraph "1-Step TD"
61 | S_t1("S_t") -->|"R_t+1"| S_t2("S_t+1")
62 | style S_t2 fill:#e6ffed,stroke:#333,stroke-width:2px
63 | end
64 | subgraph "2-Step TD"
65 | S_t3("S_t") -->|"R_t+1"| S_t4("S_t+1") -->|"R_t+2"| S_t5("S_t+2")
66 | style S_t5 fill:#e6ffed,stroke:#333,stroke-width:2px
67 | end
68 | subgraph "n-Step TD"
69 | S_t6("S_t") -->|"R_t+1"| S_t7("...") -->|"R_t+n"| S_t8("S_t+n")
70 | style S_t8 fill:#e6ffed,stroke:#333,stroke-width:2px
71 | end
72 | subgraph "Monte Carlo (∞-Step)"
73 | S_t9("S_t") -->|"R_t+1"| S_t10("...") -->|"R_T"| S_t11("Terminal")
74 | style S_t11 fill:#e6ffed,stroke:#333,stroke-width:2px
75 | end
76 | ```
77 |
78 | The update rule for our state value `V(S_t)` then simply uses this new n-step target:
79 |
80 | `V(S_t) ← V(S_t) + α * [G_{t:t+n} - V(S_t)]`
81 |
82 | #### A Step-by-Step Example: The Advantage of a Longer View
83 |
84 | Let's revisit our simple random walk environment to see why a longer lookahead can be so much more efficient.
85 |
86 | ```
87 | [Term 0] <--> A <--> B <--> C <--> D <--> E <--> [Term +1]
88 | ```
89 |
90 | * **Rules:** The agent moves Left or Right randomly.
91 | * **Rewards:** `+1` for entering the right terminal state, `0` for all other transitions.
92 | * **Setup:** We'll set the discount factor `γ=1` and the learning rate `α=0.1`. For simplicity, we'll initialize the value of all non-terminal states to a neutral `V(s) = 0.5`.
93 |
94 | **An Episode Occurs:**
95 | Imagine the agent starts in state C and has a lucky walk straight to the goal: `C → D → E → Term(+1)`.
96 |
97 | Let's focus on the very first state, `C`. After this episode, how do we update our estimate for `V(C)`? We will look back at the experience and see what different values of `n` would tell us.
98 |
99 | The trajectory from state `C` was:
100 | * `S_t = C`
101 | * `R_{t+1} = 0`, `S_{t+1} = D`
102 | * `R_{t+2} = 0`, `S_{t+2} = E`
103 | * `R_{t+3} = +1`, `S_{t+3} = Term(+1)` (Value is always 0)
104 |
105 | Let's calculate the update target for `V(C)` using different `n`:
106 |
107 | * **1-Step Return (`n=1`)**
108 | * Target `G_{t:t+1} = R_{t+1} + γV(S_{t+1}) = 0 + 1 * V(D) = 0.5`.
109 | * The update would be: `V(C) ← 0.5 + 0.1 * (0.5 - 0.5) = 0.5`.
110 | * **Result:** No change. The agent learned nothing about `C` because its immediate neighbor `D` still has the initial, uninformative value of 0.5.
111 |
112 | * **2-Step Return (`n=2`)**
113 | * Target `G_{t:t+2} = R_{t+1} + γR_{t+2} + γ^2V(S_{t+2}) = 0 + 1*0 + 1^2*V(E) = 0.5`.
114 | * The update would be: `V(C) ← 0.5 + 0.1 * (0.5 - 0.5) = 0.5`.
115 | * **Result:** Still no change! Even looking two steps ahead only gets us to state `E`, which also has the uninformative initial value.
116 |
117 | * **3-Step Return (`n=3`)**
118 | * Target `G_{t:t+3} = R_{t+1} + γR_{t+2} + γ^2R_{t+3} + γ^3V(S_{t+3})`
119 | * Target = `0 + 1*0 + 1^2*(+1) + 1^3*V(Term) = 1 + 0 = 1`.
120 | * The update would be: `V(C) ← 0.5 + 0.1 * (1 - 0.5) = 0.55`.
121 | * **Result:** Success! The 3-step return was long enough to "see" the actual `+1` reward at the end of the episode. It could use this piece of *real* information to immediately update its estimate for `V(C)`, correctly identifying that `C` is more valuable than its initial guess.
122 |
123 | This is the power of n-step methods. With 1-step TD, the reward information from the goal would have to slowly "trickle back" one state at a time over many, many episodes. But by using a larger `n`, we can propagate that credit much more rapidly, leading to significantly faster learning.
124 |
125 | ### Part 3: n-Step Control - Learning to Act with a Longer View (Revised)
126 |
127 | So far, we've developed a powerful method for *prediction*—estimating the value of states (`V(s)`) with a flexible lookahead. But as always, prediction is only half the battle. Our ultimate goal is *control*: teaching an agent what actions to take to maximize its reward.
128 |
129 | To do this, we need to shift our focus from state-values (`V(s)`) to action-values (`Q(s, a)`). This tells the agent how good it is to take a specific *action* from a specific *state*.
130 |
131 | The good news is that we can apply our n-step idea directly to our on-policy control algorithm, SARSA. This gives us **n-step SARSA**. The logic is a straightforward extension of what we've just learned.
132 |
133 | We simply redefine our n-step return in terms of Q-values. Instead of bootstrapping from the value of the state `n` steps ahead, `V(S_{t+n})`, we bootstrap from the value of the *state-action pair* `n` steps ahead, `Q(S_{t+n}, A_{t+n})`.
134 |
135 | The **n-step SARSA target** (`G_{t:t+n}`) is:
136 |
137 | `G_{t:t+n} = R_{t+1} + γR_{t+2} + ... + γ^(n-1)R_{t+n} + γ^n Q(S_{t+n}, A_{t+n})`
138 |
139 | The update rule for the action-value `Q(S_t, A_t)` follows naturally:
140 |
141 | `Q(S_t, A_t) ← Q(S_t, A_t) + α * [G_{t:t+n} - Q(S_t, A_t)]`
142 |
143 | Notice that the target depends on `A_{t+n}`—the action the agent actually chooses `n` steps into the future according to its policy. This is why n-step SARSA is still a classic **on-policy** method. Our "cautious realist" agent is still learning the value of its own exploratory policy, but now it's doing so with a longer, more informative chunk of real-world experience. It's less myopic, but just as realistic.
144 |
145 | #### The Power of Rapid Credit Assignment
146 |
147 | The real magic of n-step control comes from its ability to propagate information quickly. Let's make this concrete. Imagine all `Q` values start at 0. The goal `G` gives a large reward. In its first episode, an agent happens to find the goal after a 10-step journey.
148 |
149 | Let's represent this path abstractly:
150 |
151 | `S₀ --(A₀)--> S₁ --(A₁)--> S₂ ... S₉ --(A₉)--> S₁₀ (Goal!)`
152 |
153 | Here, the agent starts in state `S₀`, takes action `A₀`, gets to `S₁`, and so on, for 10 steps, finally reaching the goal at `S₁₀`.
154 |
155 | What does each algorithm learn from this single, successful episode?
156 |
157 | **With 1-Step SARSA:**
158 | The algorithm updates state-action pairs one step at a time. After taking action `A₈` from state `S₈` to get to `S₉`, it then takes `A₉` and reaches the goal `S₁₀`. At this point, it uses the reward to update `Q(S₉, A₉)`. That's it. To update `Q(S₈, A₈)`, it would need another episode where it visits `S₈` and then moves to an `S₉` that *already has an improved value*. The information from the goal has to slowly "trickle back", one state at a time, over many, many episodes.
159 |
160 | **With 10-Step SARSA:**
161 | The algorithm's memory is much longer. Let's consider the update for the very first action, `A₀`, taken from the start state `S₀`. The 10-step algorithm collects the full sequence of 10 rewards (`R₁` through `R₁₀`) and then bootstraps from the Q-value of the state-action pair it will be in at the end, `Q(S₁₀, A₁₀)`. Because `S₁₀` is the goal, this value is directly influenced by the big reward.
162 |
163 | This 10-step target is then used to update `Q(S₀, A₀)`. In a single stroke, the good outcome at the end of the path has directly increased the value of the very first action taken. The same logic applies to `Q(S₁, A₁)`, which uses a 9-step return, and so on. The entire path gets credit simultaneously.
164 |
165 | Instead of a slow trickle of information, n-step learning is like a flash of lightning that illuminates the entire successful path at once. This ability to assign credit to a long chain of actions allows learning to spread through the state space much more efficiently, often resulting in a dramatic speedup in finding a good policy.
166 |
167 | ### Part 4: The Off-Policy Conundrum: Importance Sampling vs. Trees
168 |
169 | We've successfully created a more powerful on-policy learner, n-step SARSA. It's faster and more efficient. But what about our off-policy friend, Q-learning? Can we create an n-step version of an off-policy algorithm?
170 |
171 | This brings us to a significant challenge. The core problem is that in off-policy learning, the agent follows one policy (the exploratory *behavior* policy, `b`) while trying to learn about another (the optimal *target* policy, `π`).
172 |
173 | Our first instinct might be to do what Q-learning does: take `n` real steps according to our behavior policy `b`, and then use the `max` operation at the end to stand in for the optimal policy `π`.
174 |
175 | But this is not enough. The `max` operation only accounts for the decision at the *end* of the n-step sequence. It doesn't account for the `n-1` actions *in the middle* of the path, which were chosen by our exploratory policy `b`, not the optimal policy `π` we want to learn about. The path itself is "off-policy."
176 |
177 | To solve this, we need a way to learn from an n-step trajectory generated by one policy while evaluating it according to another. There are two main approaches to this problem.
178 |
179 | #### Approach 1: The Correction Factor (Importance Sampling)
180 |
181 | The first approach is to ask: "How likely was this n-step journey to have happened under my target policy?" We can then use this likelihood to "correct" our update. This is done using **importance sampling**.
182 |
183 | The idea is to calculate a weight for the entire n-step sequence. This weight, called the **importance sampling ratio (`ρ`)**, is the product of the relative probabilities of taking each action along the path.
184 |
185 | `ρ_{t:t+n-1} = Π [ π(A_k | S_k) / b(A_k | S_k) ]` from `k=t` to `t+n-1`
186 |
187 | Let's break this down:
188 | * `π(A_k | S_k)` is the probability that our **target** policy (e.g., greedy) would have taken action `A_k`.
189 | * `b(A_k | S_k)` is the probability that our **behavior** policy (e.g., ε-greedy) actually took action `A_k`.
190 |
191 | We calculate this ratio for every step in our n-step window and multiply them all together.
192 | * If at any point our agent took an exploratory action that the greedy target policy would *never* take (`π(A_k | S_k) = 0`), the entire ratio `ρ` becomes 0. The experience is discarded, as it's irrelevant to the optimal policy.
193 | * We then simply multiply our n-step return by this ratio `ρ` before performing the update.
194 |
195 | **The Downside:** While correct in theory, this approach can be very unstable. If the behavior policy takes a rare exploratory action (a very small `b(A|S)`), the ratio can explode, leading to updates with extremely **high variance**. A single, unlikely event can cause a massive, destabilizing swing in the Q-values, often slowing learning down to a crawl.
196 |
197 | #### Approach 2: The Elegant Alternative (Tree Backup)
198 |
199 | If correcting for the "wrong" path is so volatile, what if we could avoid the problem altogether? This is the beautiful idea behind the **Tree-Backup Algorithm**.
200 |
201 | Instead of sampling an entire n-step path and then correcting it, the Tree-Backup algorithm mixes real samples with estimated expectations at *every single step*.
202 |
203 | The best way to understand this is with its backup diagram.
204 |
205 | ```mermaid
206 | graph TD
207 | subgraph "Update Target Construction"
208 | S_t("S_t") --"Sampled Action A_t"--> S_t1("S_t+1")
209 | S_t1 -->|"π(a'|S_t+1) * Q"| other_a1("a' ≠ A_t+1")
210 | S_t1 --"Sampled Action A_t+1"--> S_t2("S_t+2")
211 | S_t2 -->|"π(a''|S_t+2) * Q"| other_a2("a'' ≠ A_t+2")
212 | S_t2 --"..."--> S_tn("S_t+n")
213 | S_tn -->|"π(a'''|S_t+n) * Q"| all_an("all a'''")
214 | end
215 | style S_t fill:#e6ffed,stroke:#333,stroke-width:2px
216 | style other_a1 fill:#fffbe6,stroke:#333,stroke-width:1px,stroke-dasharray: 3 3
217 | style other_a2 fill:#fffbe6,stroke:#333,stroke-width:1px,stroke-dasharray: 3 3
218 | style all_an fill:#fffbe6,stroke:#333,stroke-width:1px,stroke-dasharray: 3 3
219 | ```
220 | *The diagram shows a 3-step example. The solid path is the "spine" of actions that were actually taken. The dashed "ribs" are the expected values for all the other actions that were not taken.*
221 |
222 | Here is how it works:
223 | 1. We start at `S_t` and take action `A_t`. This is our sample.
224 | 2. We land in `S_{t+1}`. To form our update target, we don't just follow the next sampled action `A_{t+1}`. Instead, we do two things:
225 | * We follow the branch for the action we **actually took**, `A_{t+1}`.
226 | * For **all other actions** `a'` that we *didn't* take, we add their estimated values (`Q(S_{t+1}, a')`) to the target, weighted by the probability that the *target policy* `π` would have taken them.
227 | 3. We repeat this process for `n` steps. At each state, we follow the single branch of what really happened, and we mix in the estimated values of all the other things that *could have* happened.
228 |
229 | The final update target is a "tree" of possibilities, branching at every state. It is composed of a single path of real rewards from sampled actions, but it's constantly being blended with the expected values of the actions we didn't take.
230 |
231 | **The Takeaway:** The Tree-Backup algorithm is a full n-step off-policy method that **completely avoids importance sampling ratios**. By mixing samples and expectations, it constructs a low-variance target. This makes it much more stable and often more data-efficient than methods that rely on correction factors. It's a different, and often better, way to solve the off-policy learning puzzle.
232 |
233 | ### Part 5: The Road Ahead - A Unified View and What's Next
234 |
235 | In this chapter, we have bridged the gap between the two extremes of model-free learning. By introducing the concept of the **n-step return**, we've shown that Monte Carlo and 1-step TD are not fundamentally different algorithms, but are instead two points on a continuous spectrum.
236 |
237 | Let's recap our journey:
238 | * We started by identifying the **bias-variance trade-off**. MC has high variance and no bias; 1-step TD has low variance but is biased by its own estimates.
239 | * We introduced the **n-step return** as a "slider" to control this trade-off, mixing `n` steps of real rewards with a single bootstrapped estimate.
240 | * We extended this idea to control with **n-step SARSA**, showing how a longer lookahead can dramatically speed up credit assignment and policy learning.
241 | * Finally, we tackled the difficult problem of **n-step off-policy learning**. We contrasted two powerful approaches: weighting trajectories with **Importance Sampling**, and building a mixed target with the **Tree-Backup** algorithm.
242 |
243 | This exploration has equipped us with a much more flexible and powerful toolkit. In many practical problems, the best performance isn't found at the extremes of `n=1` or `n=∞`, but somewhere in the middle.
244 |
245 | But as we stand at this new peak, a larger challenge comes into view. It's a limitation that has been the silent partner to *all* of our methods so far—from Monte Carlo to TD to n-step.
246 |
247 | #### The Elephant in the Room: The Lookup Table
248 |
249 | Every algorithm we have built shares a fundamental assumption: that we can store our `V(s)` or `Q(s, a)` values in a big table, with one entry for every single state or state-action pair.
250 |
251 | This is fine for small worlds like grid puzzles or tic-tac-toe. But what about real problems?
252 | * **Chess:** Has an estimated `10^47` states.
253 | * **Go:** Has more states than atoms in the known universe (`>10^170`).
254 | * **Robotics:** A robot arm with continuous joint angles has an *infinite* number of states.
255 | * **Video Games:** Controlling an agent from pixel data means each screen is a state. The number of possible screens is astronomical.
256 |
257 | This is the **scaling problem**. The lookup table approach, the very foundation of our methods, is completely non-viable for almost any problem of real-world interest. This is the Achilles' heel of the tabular approach.
258 |
259 | This forces us to ask the most important question on our journey so far: **How can an agent make good decisions in states it has never seen before?**
260 |
261 | The answer is that it must **generalize**. It needs to learn a compact representation of value that covers unseen states based on their similarity to states it has experienced. To do this, we must throw away the lookup table and replace it with something far more powerful: a **function approximator**.
262 |
263 | Instead of storing a value for `V(S_t)`, we will learn the *parameters* (or weights `**w**`) of a function `v̂(s, **w**)` that can *estimate* that value. This could be a simple linear function or a complex neural network.
264 |
265 | This is the next great leap in our journey—the move from tabular methods to approximate solution methods. It is the bridge from solving toy problems to tackling the immense complexity of the real world.
--------------------------------------------------------------------------------
/docs/llm/attention.md:
--------------------------------------------------------------------------------
1 | # **Give Me 20 Minutes, I Will Make Attention Click Forever**
2 |
3 | ## Introduction: The Core Problem & Our Blueprint
4 |
5 | The word "**bank**" in "I sat on the river **bank**" is completely different from the "**bank**" in "I withdrew money from the **bank**." For a machine, this is a huge problem. How can the representation of a word change based on its neighbors?
6 |
7 | The answer is the **Attention Mechanism**. In the next 20 minutes, you will understand exactly how it works. You will learn:
8 |
9 | * **Word Embeddings:** The static starting point for every word.
10 | * **Scaled Dot-Product Attention:** The core formula that enables context.
11 | * **Query, Key, and Value (QKV):** The three roles a word can play.
12 | * **The Causal Mask:** How to prevent the model from cheating by looking ahead.
13 | * **Multi-Head Attention:** How to scale the mechanism for powerful models.
14 |
15 | This is the entire secret. This single formula and the Python code that implements it are the engine behind models like ChatGPT.
16 |
17 | **The Formula:**
18 | $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V $$
19 |
20 | **The Code:**
21 | ```python
22 | class CausalSelfAttention(nn.Module):
23 | def __init__(self, config):
24 | super().__init__()
25 | assert config.n_embd % config.n_head == 0
26 | self.n_head, self.n_embd = config.n_head, config.n_embd
27 | self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
28 | self.c_proj = nn.Linear(config.n_embd, config.n_embd)
29 | self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))
30 |
31 | def forward(self, x):
32 | B, T, C = x.size()
33 | qkv = self.c_attn(x)
34 | q, k, v = qkv.split(self.n_embd, dim=2)
35 | head_dim = C // self.n_head
36 | q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
37 | k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
38 | v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
39 | att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(head_dim))
40 | att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
41 | att = F.softmax(att, dim=-1)
42 | y = att @ v
43 | y = y.transpose(1, 2).contiguous().view(B, T, C)
44 | return self.c_proj(y)
45 | ```
46 | This might look intimidating, but we will build it from the ground up until every line is obvious. But before we can build this engine, we must first understand the fuel it runs on: word vectors.
47 |
48 | ---
49 |
50 | ### **Chapter 1: The Starting Point & Its Fatal Flaw (Word Embeddings)**
51 |
52 | A neural network cannot understand the word "cat". It can only understand lists of numbers, called **vectors**. Our first job is to convert every word in our vocabulary into a unique vector.
53 |
54 | The mechanism for this is a simple lookup table called an **Embedding Layer**.
55 |
56 | **The Mechanism: A Learnable Dictionary**
57 | Imagine a giant spreadsheet with one row for every word in the vocabulary. Each row contains the vector for that word. The `nn.Embedding` layer is exactly this.
58 |
59 | ```python
60 | import torch
61 | import torch.nn as nn
62 |
63 | # A tiny config for our example
64 | vocab_size = 10 # Our dictionary has 10 words
65 | n_embd = 4 # Each word will be represented by a vector of size 4
66 |
67 | # The layer is our coordinate book
68 | token_embedding_table = nn.Embedding(vocab_size, n_embd)
69 |
70 | # Let's look up the vector for the word with ID=3
71 | input_id = torch.tensor([3])
72 | vector = token_embedding_table(input_id)
73 |
74 | print(f"The vector for word ID {input_id.item()} is:\n{vector}")
75 | ```
76 | **Output:**
77 | ```
78 | The vector for word ID 3 is:
79 | tensor([[-1.5323, -0.2343, 0.5132, -1.0833]], grad_fn=)
80 | ```
81 | Initially, these vectors are random. During training, the model learns the optimal vector for each word.
82 |
83 | **The Fatal Flaw: No Context**
84 |
85 | This simple lookup has one massive problem: it is **static**. The vector for a word is the same regardless of the words around it.
86 |
87 | Let's return to our "bank" example.
88 | * Sentence 1: "I sat on the river **bank**."
89 | * Sentence 2: "I withdrew money from the **bank**."
90 |
91 | Let's assume the word "bank" has ID `7` in our vocabulary. When we look up its vector, the process is identical for both sentences.
92 |
93 | | Context | Word | Lookup Process | Resulting Vector |
94 | | :--- | :--- | :--- | :--- |
95 | | "river..." | bank | `embedding_table[7]` | `[0.1, 0.8, -0.4, ...]` |
96 | | "money..." | bank | `embedding_table[7]` | `[0.1, 0.8, -0.4, ...]` **(Identical!)** |
97 |
98 | This is the core limitation. Our initial vectors are context-free. They represent a word's general meaning but are blind to the specific meaning in a sentence.
99 |
100 | This sets up our central question: **How can we dynamically modify a word's vector using the context from its neighbors?**
101 |
102 | The answer is the Attention mechanism, which we will build, step-by-step, in the next chapter.
103 |
104 | ## **Chapter 2: The Core Idea - A "Conversation" Between Words**
105 |
106 | How do we fix the fatal flaw of static embeddings? The vector for "bank" must become more "river-like" or more "money-like" depending on its context.
107 |
108 | The solution is to let the words in a sentence communicate with each other. The **Attention** mechanism allows each word to look at its neighbors and create a new, context-aware vector for itself.
109 |
110 | **The Analogy: A "Conversation" to Resolve Ambiguity**
111 |
112 | Think of this process as a three-step conversation for each word. Let's use the sentence "**Crane** lifted steel." The initial vector for "crane" is ambiguous (bird or machine?). To clarify itself, "crane" will:
113 |
114 | 1. **Ask a Question:** It will formulate a query about itself.
115 | 2. **Find Relevant Neighbors:** It will compare its query to labels provided by every other word.
116 | 3. **Absorb Information:** It will take a weighted average of information from its neighbors, listening more to the most relevant ones.
117 |
118 | To make this possible, each word's initial vector is used to derive three new vectors: a **Query**, a **Key**, and a **Value**.
119 |
120 | | Vector | Role | Analogy |
121 | | :--- | :--- | :--- |
122 | | **Query (Q)** | What I'm looking for. | The word's "search query" or "question." |
123 | | **Key (K)** | What I have. | The word's "label" or "keyword." This is what queries are matched against. |
124 | | **Value (V)**| What I'll give you. | The actual "information" or "substance" the word provides if a match is found. |
125 |
126 | **A Concrete Walkthrough: "Crane lifted steel"**
127 |
128 | Let's imagine a 2D space where Dimension 1 is "Animal-ness" and Dimension 2 is "Machine-ness".
129 |
130 | The ambiguous "crane" starts with a balanced vector. To resolve this, it uses its **Query** to probe the **Keys** of all words in the sentence (including itself).
131 |
132 | ```
133 | Image description:
134 | Three dots on a 2D plane labeled "Animal-ness" (x-axis) and "Machine-ness" (y-axis).
135 | - A dot for "crane" is at (0.7, 0.7), representing ambiguity.
136 | - A dot for "lifted" is at (0.1, 0.9), highly machine-like.
137 | - A dot for "steel" is at (0.1, 0.9), highly machine-like.
138 | An arrow originates from the "crane" dot, pointing towards the other dots, labeled "Query".
139 | Each dot has a label next to it, "Key".
140 | ```
141 |
142 | **Step 1: Scoring (Query probes Key)**
143 | The "crane" query, which is looking for context, finds a strong match with the "machine-like" keys of "lifted" and "steel". The mathematical operation for this "matching" is the **dot product**. A high dot product means high similarity.
144 |
145 | * `Score(crane -> lifted)`: HIGH (The machine-like parts align)
146 | * `Score(crane -> steel)`: HIGH (The machine-like parts align)
147 | * `Score(crane -> crane)`: Medium (It aligns with its own ambiguous self)
148 |
149 | **Step 2: Normalizing (Deciding who to listen to)**
150 | The raw scores are converted into percentages that sum to 1. This is done with the **softmax** function.
151 |
152 | * Attention for "crane" might become: { `crane`: 20%, `lifted`: 40%, `steel`: 40% }
153 | * This means "crane" has decided to construct its new self by listening mostly to "lifted" and "steel".
154 |
155 | **Step 3: Aggregating (Absorbing the information)**
156 | The new vector for "crane" is a weighted average of all the **Value** vectors in the sentence.
157 |
158 | * `New_Vector(crane) = 0.2 * V(crane) + 0.4 * V(lifted) + 0.4 * V(steel)`
159 |
160 | Since the Value vectors for "lifted" and "steel" are heavily "machine-like," they pull the new "crane" vector in that direction.
161 |
162 | ```
163 | Image description:
164 | The same 2D plane as before.
165 | The original "crane" dot is still at (0.7, 0.7) but is now faded.
166 | Two arrows, one from "lifted" and one from "steel", point to a new location on the plane.
167 | A new, solid dot for "crane" now appears at approximately (0.3, 0.8), much higher on the "Machine-ness" axis.
168 | This new dot is labeled "Updated Crane Vector".
169 | ```
170 |
171 | The result: The original, ambiguous "crane" vector has been transformed into a new, context-aware vector that is unambiguously "machine-like". The conversation worked.
172 |
173 | This three-step process—**Score, Normalize, Aggregate**—is the heart of the Attention mechanism. In the next chapter, we will translate this exact logic into efficient matrix operations.
174 |
175 | ## **Chapter 3: The Engine - Dot-Product Attention in Code**
176 |
177 | We've built the intuition: **Score, Normalize, Aggregate**. Now, let's translate this "conversation" into efficient matrix mathematics. By performing these three steps on matrices, we can process every word in a sentence simultaneously.
178 |
179 | Our map for this chapter is the core of the attention formula:
180 | $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
181 |
182 | We will build this with raw tensors to see every number. We'll use a tiny example sentence with 3 tokens (`T=3`), each represented by a 2-dimensional vector (`C=2`).
183 |
184 | **The Input (`x`): Our Static Embeddings**
185 | This is the tensor from our embedding layer. Shape `(B, T, C)` is `(1, 3, 2)`.
186 |
187 | ```python
188 | import torch
189 | import torch.nn.functional as F
190 | import math
191 |
192 | # Our input: Batch=1, Tokens=3, Channels=2
193 | x = torch.tensor([[[1.0, 0.2], # Vector for Token 1
194 | [0.8, 0.5], # Vector for Token 2
195 | [0.1, 0.9]]]) # Vector for Token 3
196 | ```
197 |
198 | **Step 1: Get Q, K, and V**
199 | In a real model, Q, K, and V are produced by passing `x` through three separate, learnable `nn.Linear` layers. This allows the model to learn the best "query", "key", and "value" representation for each word. For this tutorial, we will simplify and set them all equal to `x`.
200 |
201 | ```python
202 | q, k, v = x, x, x
203 | ```
204 |
205 | **Step 2: Score (`QK^T`)**
206 | This is the heart of the "conversation." To compute the similarity score of every token's query with every other token's key, we use a single matrix multiplication.
207 | * `q` has shape `(1, 3, 2)`.
208 | * We transpose `k` to `k.transpose(-2, -1)`, giving it a shape of `(1, 2, 3)`.
209 | * The multiplication `(1, 3, 2) @ (1, 2, 3)` results in a `(1, 3, 3)` matrix of scores.
210 |
211 | ```python
212 | scores = q @ k.transpose(-2, -1)
213 | print("--- Raw Scores (Attention Matrix) ---")
214 | print(scores.shape)
215 | print(scores.data.round(decimals=2))
216 | ```
217 | **Output:**
218 | ```
219 | --- Raw Scores (Attention Matrix) ---
220 | torch.Size([1, 3, 3])
221 | tensor([[[1.04, 0.90, 0.28], # Token 1's scores for (T1, T2, T3)
222 | [0.90, 0.89, 0.53], # Token 2's scores for (T1, T2, T3)
223 | [0.28, 0.53, 0.82]]]) # Token 3's scores for (T1, T2, T3)
224 | ```
225 | This `(3,3)` matrix holds the raw compatibility scores. For example, the query for Token 1 (row 0) has the highest compatibility with the key for Token 1 (column 0), which is `1.04`.
226 |
227 | **Step 3: Scale**
228 | This is the ` / sqrt(d_k)` part of the formula. We divide the scores by the square root of the key dimension (`d_k` is the last dimension of `k`, which is `2`). This is a small technical detail that helps stabilize the training process, especially in large models.
229 | ```python
230 | d_k = k.size(-1) # d_k = 2
231 | scaled_scores = scores / math.sqrt(d_k)
232 | ```
233 |
234 | **Step 4: Normalize (`softmax`)**
235 | We apply the `softmax` function along each row. This converts the raw scores into attention weights that sum to 1, representing the percentages from our intuition.
236 | ```python
237 | weights = F.softmax(scaled_scores, dim=-1) # Softmax along the rows
238 | print("\n--- Attention Weights ---")
239 | print(weights.data.round(decimals=2))
240 | ```
241 | **Output:**
242 | ```
243 | --- Attention Weights ---
244 | tensor([[[0.39, 0.35, 0.26],
245 | [0.36, 0.36, 0.28],
246 | [0.28, 0.34, 0.38]]])
247 | ```
248 | Each row now sums to 1. For example, Token 2 (row 1) will construct its new self by listening 36% to Token 1, 36% to itself, and 28% to Token 3.
249 |
250 | **Step 5: Aggregate Values (`weights @ V`)**
251 | Finally, we use our weights to create a weighted average of the **Value** vectors.
252 | * `weights` has shape `(1, 3, 3)`.
253 | * `v` has shape `(1, 3, 2)`.
254 | * The multiplication `(1, 3, 3) @ (1, 3, 2)` produces a final tensor of shape `(1, 3, 2)`.
255 |
256 | ```python
257 | output = weights @ v
258 | print("\n--- Final Output (Context-Aware Vectors) ---")
259 | print(output.shape)
260 | print(output.data.round(decimals=2))
261 | ```
262 | **Output:**
263 | ```
264 | --- Final Output (Context-Aware Vectors) ---
265 | torch.Size([1, 3, 2])
266 | tensor([[[0.69, 0.51],
267 | [0.67, 0.53],
268 | [0.59, 0.58]]])
269 | ```
270 | Success! We have taken our raw input `x` and produced a new tensor `output` of the exact same shape, where each token's vector has been updated with information from its neighbors.
271 |
272 | Here is a summary of the tensor transformations:
273 |
274 | | Step | Operation | Input Shapes | Output Shape `(B, T, ...)` |
275 | | :--- | :--- | :--- | :--- |
276 | | 1 | `Q, K, V = proj(x)` | `(1, 3, 2)` | `(1, 3, 2)` |
277 | | 2 | `Q @ K.T` | `(1, 3, 2)` & `(1, 2, 3)` | `(1, 3, 3)` |
278 | | 3 | `/ sqrt(d_k)` | `(1, 3, 3)` | `(1, 3, 3)` |
279 | | 4 | `softmax` | `(1, 3, 3)` | `(1, 3, 3)` |
280 | | 5 | `weights @ V`| `(1, 3, 3)` & `(1, 3, 2)` | `(1, 3, 2)` |
281 |
282 | We have now built the core engine. In the next chapter, we'll add two crucial upgrades to make it practical for real-world models.
283 |
284 | ## **Chapter 4: The Upgrades - Making Attention Practical**
285 |
286 | We have built the core attention engine. However, to use it in a real model like GPT, we need two crucial upgrades.
287 | 1. **Causality:** We must prevent the model from looking into the future when generating text.
288 | 2. **Parallelism:** We need to make the "conversation" richer by allowing it to happen from multiple perspectives at once.
289 |
290 | #### **Part 1: The Causal Mask ("Don't Look Ahead")**
291 |
292 | **The Problem:** GPT is an **autoregressive** model. When predicting the next word in the sentence "A cat sat...", its decision must be based *only* on the tokens it has seen so far: "A" and "cat". It cannot be allowed to see the answer, "sat".
293 |
294 | Our current attention matrix allows this cheating. The token "A" (at position 0) is gathering information from "cat" (position 1) AND "sat" (position 2). This is a problem.
295 |
296 | **The Solution:** The Causal Mask. We will modify the attention **score matrix** *before* applying the softmax function. We will "mask out" all future positions by setting their scores to negative infinity (`-inf`).
297 |
298 | Why `-inf`? Because the `softmax` function involves an exponential: `e^x`. The exponential of negative infinity, `e^-inf`, is effectively zero. This forces the attention weights for all future tokens to become `0`, preventing any information flow.
299 |
300 | **The Mechanism:**
301 | 1. **Create a Mask:** We use `torch.tril` to create a lower-triangular matrix. The `0`s in the upper-right triangle represent the "future" connections we must block.
302 | ```python
303 | T = 3
304 | mask = torch.tril(torch.ones(T, T))
305 | print("--- The Mask ---")
306 | print(mask)
307 | # tensor([[1., 0., 0.],
308 | # [1., 1., 0.],
309 | # [1., 1., 1.]])
310 | ```
311 | 2. **Apply the Mask:** We use `masked_fill` to apply our mask to the `scaled_scores` from the last chapter.
312 | ```python
313 | # Before masking
314 | # scaled_scores = tensor([[[0.74, 0.64, 0.20], ... ]])
315 |
316 | masked_scores = scaled_scores.masked_fill(mask == 0, float('-inf'))
317 |
318 | print("\n--- Scores After Masking ---")
319 | print(masked_scores.data.round(decimals=2))
320 | # tensor([[[ 0.74, -inf, -inf],
321 | # [ 0.64, 0.63, -inf],
322 | # [ 0.20, 0.37, 0.58]]])
323 | ```
324 | 3. **Re-run Softmax:** Applying softmax to these masked scores gives us causal attention weights.
325 | ```python
326 | causal_weights = F.softmax(masked_scores, dim=-1)
327 | print("\n--- Final Causal Attention Weights ---")
328 | print(causal_weights.data.round(decimals=2))
329 | # tensor([[[1.00, 0.00, 0.00],
330 | # [0.50, 0.50, 0.00],
331 | # [0.29, 0.35, 0.36]]])
332 | ```
333 | The upper-right triangle of our attention matrix is now all zeros. "A" can only attend to itself. "cat" can only attend to "A" and itself. Information now only flows from the past to the present.
334 |
335 | ---
336 | #### **Part 2: Multi-Head Attention ("Many Conversations at Once")**
337 |
338 | **The Problem:** Our current attention mechanism is like having one person in a meeting who is responsible for figuring out all the relationships between words (syntax, semantics, etc.). This is a lot of pressure.
339 |
340 | **The Solution:** Multi-Head Attention. We split our embedding dimension `C` into several smaller chunks, called "heads". Each head will be its own independent attention mechanism, conducting its own "conversation" in parallel.
341 |
342 | * **Head 1** might learn to focus on verb-object relationships.
343 | * **Head 2** might learn to focus on pronoun references.
344 | * ...and so on.
345 |
346 | **The Mechanism:**
347 | Let's use a realistic `C = 768` and `n_head = 12`. The dimension of each head will be `head_dim = C / n_head = 64`.
348 |
349 | 1. **Split:** We take our Q, K, and V tensors (each shape `B, T, C`) and reshape them to `(B, n_head, T, head_dim)`. This makes the "heads" an explicit dimension.
350 | ```python
351 | # B=1, T=3, C=768
352 | q = torch.randn(1, 3, 768)
353 |
354 | # Split C into (n_head, head_dim) -> (12, 64)
355 | q_multi_head = q.view(1, 3, 12, 64)
356 |
357 | # Bring the head dimension forward for parallel computation
358 | q_multi_head = q_multi_head.transpose(1, 2) # -> (1, 12, 3, 64)
359 | ```
360 | 2. **Attend in Parallel:** We perform the exact same scaled dot-product attention as before. PyTorch's broadcasting automatically handles the `n_head` dimension, performing 12 attention calculations at once. The output has shape `(B, n_head, T, head_dim)`.
361 | 3. **Merge:** We reverse the split operation. We concatenate the heads back together into a single `C`-dimensional vector.
362 | ```python
363 | # Transpose back and reshape
364 | merged_output = output_per_head.transpose(1, 2).contiguous().view(1, 3, 768)
365 | ```
366 | 4. **Project:** We pass this merged output through a final linear layer (`c_proj`). This allows the model to learn how to best combine the insights from all the different heads.
367 |
368 | By having multiple parallel conversations, the model can analyze the input text from many different perspectives at the same time, making it far more powerful.
369 |
370 | We now have all the conceptual pieces. In the final chapter, we will assemble them into our complete, production-ready code.
371 |
372 | ## **Chapter 5: The Final Blueprint & Conclusion**
373 |
374 | We have built all the pieces: the core engine, the causal mask, and the multi-head architecture. Now, let's look at our final blueprint one last time to see how these concepts snap together into a single, elegant piece of code.
375 |
376 | The intimidating module from the introduction should now look like a familiar map.
377 |
378 | ```python
379 | class CausalSelfAttention(nn.Module):
380 | def __init__(self, config):
381 | super().__init__()
382 | assert config.n_embd % config.n_head == 0
383 | # The layers for QKV projection, multi-head output, and the mask
384 | self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
385 | self.c_proj = nn.Linear(config.n_embd, config.n_embd)
386 | self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size)).view(1, 1, config.block_size, config.block_size))
387 | # ...
388 |
389 | def forward(self, x):
390 | B, T, C = x.size()
391 |
392 | # 1. Get Q, K, V from a single efficient projection
393 | q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
394 |
395 | # 2. Split into multiple heads for parallel conversations
396 | q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
397 | k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
398 | v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
399 |
400 | # 3. The core engine: scaled, masked, dot-product attention
401 | att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
402 | att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf")) # No looking ahead!
403 | att = F.softmax(att, dim=-1)
404 | y = att @ v
405 |
406 | # 4. Merge the heads back together and finalize
407 | y = y.transpose(1, 2).contiguous().view(B, T, C)
408 | return self.c_proj(y)
409 | ```
410 | Each part of this code now has a clear purpose:
411 | * **Problem:** We take a static input `x` and produce a context-aware output `y`.
412 | * **Intuition:** The Q, K, V "conversation" is implemented here.
413 | * **Engine:** The core `q @ k.transpose...` logic is the mathematical heart.
414 | * **Upgrades:** The `masked_fill` provides causality, and the `view/transpose` operations create the parallel heads.
415 |
416 | In the last 20 minutes, we have gone on a journey. We started with a fundamental problem: words are static, but meaning depends on context. We solved it by building a mechanism that allows words to have a "conversation."
417 |
418 | We built the intuition for this conversation with Queries, Keys, and Values. We translated that intuition into the efficient mathematics of dot-product attention. We then upgraded it with a causal mask and multi-head parallelism to make it powerful and practical.
419 |
420 | This `CausalSelfAttention` module is the single most important component of modern large language models. It is the engine that drives understanding in every "Transformer Block," which are then stacked dozens of times to create models like GPT.
421 |
422 | The magic is gone, replaced by elegant, understandable engineering. **Attention has clicked.**
--------------------------------------------------------------------------------
/docs/math/eigen.md:
--------------------------------------------------------------------------------
1 | # A Practical Guide to Eigenvalues & Eigenvectors
2 |
3 | ## Introduction: Your 40-Minute Promise
4 |
5 | Give me 40 minutes, and I will make one of linear algebra's most important concepts click. We will move beyond treating matrices as simple grids of numbers and uncover the deep information they hold about the systems they represent.
6 |
7 | This tutorial is direct. We will start with a visual intuition, formalize it into a step-by-step algorithm, and apply it to a practical problem.
8 |
9 | By the end, you will understand the following:
10 |
11 | | The Concept & Its Formalism | What You Will Understand |
12 | | :-------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ |
13 | | **The Core Idea**
`Av = λv` | What Eigenvectors and Eigenvalues are and why they represent the "axes of a transformation." |
14 | | **The Algorithm**
`det(A - λI) = 0` | The step-by-step recipe to find the eigenvalues (λ) and eigenvectors (v) for any square matrix. |
15 | | **The Application**
`v_k = A^k v₀` | How eigenvalues predict the long-term behavior and stability of a system, from population growth to physics. |
16 |
17 | This is the machinery that powers everything from Google's PageRank algorithm to facial recognition. Let's begin.
18 |
19 | ### Quick Recap: A Matrix is an Action
20 |
21 | Before we hunt for special vectors, we must recall a core concept: **a matrix is an action.**
22 |
23 | When we multiply a vector `v` by a matrix `A`, we are applying a **linear transformation**. The matrix `A` acts on the input vector `v` to produce a new output vector `v'`.
24 |
25 | The equation is simply:
26 | `v' = Av`
27 |
28 | This action can be a rotation, a scaling, a shear (a slant), or a combination of these.
29 |
30 | #### A Concrete Example: A Shear Transformation
31 |
32 | Consider this matrix `A`, which represents a shear transformation. It pushes things horizontally.
33 | `A = [[1, 1], [0, 1]]`
34 |
35 | Let's see what this action does to a sample vector, `v = [2, 3]`.
36 |
37 | **Input:**
38 | ```python
39 | import numpy as np
40 |
41 | A = np.array([[1, 1], [0, 1]])
42 | v = np.array([2, 3])
43 | ```
44 |
45 | **Calculation:**
46 | `v_prime = A @ v`
47 |
48 | This is the matrix-vector multiplication:
49 | `[[1, 1], [0, 1]] * [[2], [3]] = [[1*2 + 1*3], [0*2 + 1*3]] = [[5], [3]]`
50 |
51 | **Output:**
52 | ```python
53 | print(f"Original vector v: {v}")
54 | print(f"Transformed vector v': {v_prime}")
55 | # Output:
56 | # Original vector v: [2 3]
57 | # Transformed vector v': [5 3]
58 | ```
59 |
60 | ```
61 | A 2D coordinate plane.
62 | Vector 'v' is an arrow from the origin (0,0) to the point (2,3).
63 | Vector 'v_prime' is another arrow from the origin (0,0) to the point (5,3).
64 | The tip of vector 'v' has been pushed horizontally to the right. The two vectors do not lie on the same line.
65 | ```
66 |
67 | The shear matrix `A` took our vector `v` and knocked it off its original line. The direction changed.
68 |
69 | This leads to the central question of this tutorial:
70 |
71 | **Are there any special, non-zero vectors that a matrix *doesn't* knock off their line? Are there vectors whose direction remains unchanged by the transformation?**
72 |
73 | The answer is yes. These special vectors are the **eigenvectors** of the matrix. Let's find them.
74 |
75 | ## Chapter 1: The Core Idea - "Axes of a Transformation"
76 |
77 | **The Big Picture:** We are on a hunt. We are looking for the special directions in space that a matrix transformation does not change. These directions are the "axes" of the transformation, and they reveal its most fundamental properties.
78 |
79 | #### The Motivating Problem: Finding an Unchanged Direction
80 |
81 | Let's continue with our shear matrix `A = [[1, 1], [0, 1]]`. We saw it knocked the vector `[2, 3]` off its line. What about other vectors?
82 |
83 | **Case 1: A vertical vector**
84 | Let's try a vector pointing straight up, `v_up = [0, 1]`.
85 | `A * v_up = [[1, 1], [0, 1]] * [[0], [1]] = [[1*0 + 1*1], [0*0 + 1*1]] = [[1], [1]]`
86 | The vector `[0, 1]` was transformed into `[1, 1]`. Its direction changed. It was knocked off its original line (the y-axis).
87 |
88 | **Case 2: A horizontal vector**
89 | Now let's try a vector pointing straight right, `v_right = [1, 0]`.
90 | `A * v_right = [[1, 1], [0, 1]] * [[1], [0]] = [[1*1 + 1*0], [0*1 + 1*0]] = [[1], [0]]`
91 | This is different. The vector `[1, 0]` was transformed into... `[1, 0]`. Its direction is **perfectly unchanged**.
92 |
93 | We found one. The horizontal direction is a special, stable direction for this shear transformation.
94 |
95 | ```
96 | A 2D coordinate plane showing a shear transformation.
97 | A vertical blue arrow at (0,1) is transformed into a slanted blue arrow at (1,1). The direction clearly changes.
98 | A horizontal red arrow at (1,0) is transformed and lands right back on top of itself at (1,0). The direction is unchanged.
99 | The red arrow represents an eigenvector.
100 | ```
101 |
102 | #### Intuition 1: Eigenvectors are Directions that Don't Turn
103 |
104 | An **eigenvector** of a matrix `A` is any non-zero vector `v` that, when acted upon by `A`, does not change its direction. It stays on the same line through the origin.
105 |
106 | For our shear matrix `A`, the vector `v = [1, 0]` is an eigenvector. Any vector on that same line, like `[3, 0]` or `[-2, 0]`, is also an eigenvector. The entire x-axis is a special direction for this transformation. This special line or set of vectors is called the **eigenspace**.
107 |
108 | #### Intuition 2: Eigenvalues are the Stretch Factors
109 |
110 | When we found our eigenvector `v = [1, 0]`, the result was `Av = [1, 0]`.
111 | How much was `v` stretched? It was multiplied by `1`.
112 | `Av = 1 * v`
113 |
114 | This stretch/squish factor is the **eigenvalue**, denoted by the Greek letter lambda (`λ`). Each eigenvector has a corresponding eigenvalue that tells you *how much* it was scaled along its special direction.
115 |
116 | * If `|λ| > 1`, the eigenvector is stretched.
117 | * If `|λ| < 1`, the eigenvector is compressed.
118 | * If `λ < 0`, the eigenvector is flipped and points in the opposite direction (but still on the same line).
119 |
120 | #### Formalization: The Eigen-Equation
121 |
122 | This entire relationship is captured in one elegant equation:
123 |
124 | `Av = λv`
125 |
126 | Let's translate this into plain English:
127 |
128 | > "The action of the matrix `A` on its special eigenvector `v`...
129 | > ...is identical to simply scaling that vector `v` by a number `λ`."
130 |
131 | This is the central equation of the topic. Finding the eigenvalues and eigenvectors of a matrix means finding all the special pairs `(λ, v)` that make this equation true. Our visual hunt was successful for one pair:
132 |
133 | * For `A = [[1, 1], [0, 1]]`, one solution is `λ = 1` and `v = [1, 0]`.
134 |
135 | Visual inspection is not a reliable method. We need a systematic algorithm to solve for `λ` and `v` for any matrix. That is our next step.
136 |
137 | ## Chapter 2: The Algorithm - The Recipe for Finding Eigen-Properties
138 |
139 | **The Big Picture:** We found an eigenvector through visual inspection, but that's not a reliable method. We need a systematic recipe to solve the core equation `Av = λv` for any matrix `A`. This chapter builds that recipe.
140 |
141 | Our goal is to solve `Av = λv`, but we have two unknowns: the scalar eigenvalue `λ` and the vector `v`. We need to untangle them.
142 |
143 | #### The Algebraic Rearrangement
144 |
145 | Let's get all the terms involving the vector `v` onto one side of the equation.
146 |
147 | 1. Start with the eigen-equation:
148 | `Av = λv`
149 |
150 | 2. Subtract `λv` from both sides:
151 | `Av - λv = 0`
152 |
153 | 3. Factor out the vector `v`:
154 | `(A - λ)v = 0`
155 |
156 | Wait. This has a mathematical problem. `A` is a matrix, but `λ` is a scalar. You cannot subtract a number from a grid of numbers. We need to make `λ` "matrix-compatible."
157 |
158 | The solution is to use the **Identity Matrix**, `I`. Recall that `Iv = v` for any vector `v`. Multiplying by `I` is the matrix equivalent of multiplying by `1`.
159 |
160 | Let's rewrite `λv` as `λIv`. Now our equation becomes:
161 | `Av - λIv = 0`
162 |
163 | Now we can legally factor out `v`:
164 | `(A - λI)v = 0`
165 |
166 | This is the standard form we will work with.
167 |
168 | #### The Key Insight: The Determinant
169 |
170 | Let's look at our rearranged equation: `(A - λI)v = 0`.
171 |
172 | Let's call the new matrix `M = (A - λI)`. The equation is now `Mv = 0`.
173 |
174 | This equation asks: "Which vector `v` does the matrix `M` transform into the zero vector?"
175 |
176 | There's always one trivial, useless answer: `v = [0, 0, ...]`. The zero vector. But eigenvectors, by definition, must be **non-zero**.
177 |
178 | So, we are looking for a non-trivial solution. When does `Mv = 0` have a solution where `v` is not zero? This only happens if the matrix `M` is **singular**. A singular matrix is one that collapses space into a lower dimension (e.g., squishes a 2D plane onto a single line). When this happens, a whole line of vectors gets mapped to the origin, giving us our non-zero solutions.
179 |
180 | And what is the definitive test for a singular matrix? Its determinant is zero.
181 |
182 | Therefore, the only way for `(A - λI)v = 0` to have a non-zero solution `v` is if:
183 |
184 | `det(A - λI) = 0`
185 |
186 | This is the key. We have eliminated the vector `v` for a moment and created an equation that contains only `λ`. This is called the **characteristic equation**.
187 |
188 | #### The Two-Step Algorithm
189 |
190 | We now have a reliable, two-step recipe for finding all eigenvalues and eigenvectors of a matrix `A`.
191 |
192 | | Step | Action |
193 | | :----------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------- |
194 | | **1. Find Eigenvalues** | Set up and solve the **characteristic equation**: `det(A - λI) = 0`. This will result in a polynomial in `λ`. The roots of this polynomial are the eigenvalues of `A`. |
195 | | **2. Find Eigenvectors** | For **each** eigenvalue `λ` you just found, plug it back into the equation `(A - λI)v = 0`. Solve this system of linear equations to find the vector `v`. This vector (and any multiple of it) is the corresponding eigenvector. |
196 |
197 | This process will give us every `(λ, v)` pair that satisfies the original `Av = λv` equation. In the next chapter, we will execute this recipe with a concrete example.
198 |
199 | ## Chapter 2: The Algorithm in Action - A Step-by-Step Example
200 |
201 | **The Big Picture:** We need a reliable recipe to solve `Av = λv`. In this chapter, we will learn this recipe by applying it directly to a concrete matrix, step-by-step. The process has two main goals: first find the eigenvalues (`λ`), then find the corresponding eigenvectors (`v`).
202 |
203 | **Our Example Matrix:**
204 | Let's find the eigenvalues and eigenvectors for the following matrix `A`.
205 | `A = [[3, 1], [1, 3]]`
206 | Notice that this matrix is **symmetric** (it's equal to its transpose). This is a special property, and we should watch for its effects on our final answer.
207 |
208 | ---
209 |
210 | ### Step 1: Find the Eigenvalues (λ)
211 |
212 | **The Goal:** Solve the **characteristic equation**, `det(A - λI) = 0`. This will give us the eigenvalues.
213 |
214 | **Action A: Construct the matrix `A - λI`.**
215 | This is our starting matrix `A` with `λ` subtracted from its main diagonal. The Identity Matrix `I` makes the math work.
216 |
217 | `A - λI = [[3, 1], [1, 3]] - λ * [[1, 0], [0, 1]]`
218 | `= [[3, 1], [1, 3]] - [[λ, 0], [0, λ]]`
219 | `= [[3-λ, 1 ], [ 1 , 3-λ]]`
220 |
221 | **Action B: Calculate its determinant and set it to zero.**
222 | For a 2x2 matrix `[[a, b], [c, d]]`, the determinant is `ad - bc`.
223 |
224 | `det([[3-λ, 1], [1, 3-λ]]) = (3-λ)(3-λ) - (1)(1) = 0`
225 |
226 | **Action C: Solve the polynomial for λ.**
227 | Now we just do the algebra.
228 |
229 | `(3-λ)(3-λ) - 1 = 0`
230 | `9 - 3λ - 3λ + λ² - 1 = 0`
231 | `λ² - 6λ + 8 = 0`
232 |
233 | This is a simple quadratic equation. It factors nicely:
234 | `(λ - 4)(λ - 2) = 0`
235 |
236 | The solutions are our eigenvalues:
237 | `λ₁ = 4`
238 | `λ₂ = 2`
239 |
240 | **Result of Step 1:** We've done it. The two special "stretch factors" for our matrix `A` are 4 and 2. This means our transformation has two special axes (eigenspaces). Along one axis, vectors are scaled by 4. Along the other, they are scaled by 2.
241 |
242 | ---
243 |
244 | ### Step 2: Find the Eigenvectors (v)
245 |
246 | **The Goal:** For each eigenvalue we just found, we plug it back into the equation `(A - λI)v = 0` and solve for the vector `v`.
247 |
248 | #### Case 1: For Eigenvalue `λ₁ = 4`
249 |
250 | **Action A: Plug `λ = 4` into `(A - λI)`.**
251 | `(A - 4I)v = 0`
252 | `[[3-4, 1 ], [ 1 , 3-4]] * [[x], [y]] = [[0], [0]]`
253 | `[[-1, 1], [ 1, -1]] * [[x], [y]] = [[0], [0]]`
254 |
255 | **Action B: Solve the system of linear equations.**
256 | This matrix equation represents two linear equations:
257 | 1. `-x + y = 0` (which means `y = x`)
258 | 2. `x - y = 0` (which also means `y = x`)
259 |
260 | Notice that both equations are identical. This is not a mistake; it is a guarantee. The system is meant to have infinite solutions, which form the line (the eigenspace) we are looking for.
261 |
262 | We need to find a non-zero vector `v = [x, y]` where `y = x`. The simplest choice is `x=1`, which makes `y=1`.
263 |
264 | The eigenvector `v₁` corresponding to `λ₁ = 4` is `v₁ = [1, 1]`.
265 |
266 | #### Case 2: For Eigenvalue `λ₂ = 2`
267 |
268 | **Action A: Plug `λ = 2` into `(A - λI)`.**
269 | `(A - 2I)v = 0`
270 | `[[3-2, 1 ], [ 1 , 3-2]] * [[x], [y]] = [[0], [0]]`
271 | `[[1, 1], [1, 1]] * [[x], [y]] = [[0], [0]]`
272 |
273 | **Action B: Solve the system.**
274 | This gives us the equation:
275 | `x + y = 0` (which means `y = -x`)
276 |
277 | We need a vector where the y-component is the negative of the x-component. The simplest choice is `x=1`, which makes `y=-1`.
278 |
279 | The eigenvector `v₂` corresponding to `λ₂ = 2` is `v₂ = [1, -1]`.
280 |
281 | ---
282 |
283 | ### Final Result & Verification
284 |
285 | We have successfully executed the algorithm. The eigen-pairs are:
286 |
287 | * **Eigenvalue `λ₁ = 4` with Eigenvector `v₁ = [1, 1]`**
288 | * **Eigenvalue `λ₂ = 2` with Eigenvector `v₂ = [1, -1]`**
289 |
290 | Let's quickly verify the first pair to be sure. Does `Av₁` equal `λ₁v₁`?
291 |
292 | * `Av₁ = [[3, 1], [1, 3]] * [[1], [1]] = [[3*1 + 1*1], [1*1 + 3*1]] = [[4], [4]]`
293 | * `λ₁v₁ = 4 * [[1], [1]] = [[4], [4]]`
294 |
295 | They match. The algorithm works.
296 |
297 | #### Spotlight: The Power of Symmetric Matrices
298 |
299 | Remember we noted that our starting matrix `A` was symmetric? This was not a coincidence. It gave us two special properties:
300 | 1. **The eigenvalues (4 and 2) are real numbers.** This is always true for symmetric matrices.
301 | 2. **The eigenvectors ([1, 1] and [1, -1]) are orthogonal.** We can check this with the dot product: `(1)(1) + (1)(-1) = 1 - 1 = 0`. They are perfectly perpendicular. This is also always true for the eigenvectors of a symmetric matrix.
302 |
303 | This "clean geometry" of real eigenvalues and orthogonal eigenvectors is why symmetric matrices are so fundamental in physics and data science. They represent well-behaved systems.
304 |
305 | ## Chapter 3: Why We Care - Predicting a System's Future
306 |
307 | **The Big Picture:** Eigenvalues and eigenvectors are powerful because they allow us to predict the long-term behavior of a system that evolves in steps. Crucially, they work for *any* starting condition, not just special cases. They reveal the ultimate destiny of the system.
308 |
309 | #### The Motivating Problem: A Population Model
310 |
311 | Let's revisit our population model. A state is a vector `v = [[city_dwellers], [suburb_dwellers]]`, and a transition matrix `A` describes the yearly change.
312 |
313 | `A = [[0.95, 0.10], [0.05, 0.90]]`
314 |
315 | The population after `k` years is `v_k = A^k * v₀`. Our goal is to predict the state after 50 years for an arbitrary starting population, say `v₀ = [[300], [1700]]`. Brute-forcing `A^50` is inefficient and gives little insight.
316 |
317 | #### The General Solution: Decomposing the Problem
318 |
319 | The previous explanation relied on `v₀` being an eigenvector. But the real power comes from this fact: for most matrices, their eigenvectors form a **basis**. This means we can write *any* starting vector `v₀` as a weighted sum of the eigenvectors.
320 |
321 | From the last chapter, we know our population matrix `A` has two eigen-pairs:
322 | * `λ₁ = 1.0` with eigenvector `v₁ ≈ [0.894, 0.447]` (the stable state)
323 | * `λ₂ = 0.85` with eigenvector `v₂ ≈ [-0.447, 0.894]` (a decaying state)
324 |
325 | So, we can express our starting vector `v₀ = [[300], [1700]]` as:
326 | `v₀ = c₁v₁ + c₂v₂`
327 |
328 | Here, `c₁` and `c₂` are just weights that tell us "how much" of each eigenvector is in our initial mix. (Finding the exact values of `c₁` and `c₂` is a standard linear algebra problem, but for now, just know that they exist).
329 |
330 | Now, let's see what happens when we apply the transformation `A^50` to this combination.
331 |
332 | 1. Start with our decomposed vector:
333 | `v₅₀ = A^50 * (c₁v₁ + c₂v₂)`
334 |
335 | 2. Because matrix multiplication is linear, we can distribute the `A^50`:
336 | `v₅₀ = c₁ * (A^50 v₁) + c₂ * (A^50 v₂)`
337 |
338 | 3. Now we can use the "easy mode" trick on each part! We know that `A^50 v₁ = (λ₁^50) v₁` and `A^50 v₂ = (λ₂^50) v₂`. Substitute that in:
339 | `v₅₀ = c₁ * (λ₁^50) * v₁ + c₂ * (λ₂^50) * v₂`
340 |
341 | This is the general, powerful formula for predicting the future state.
342 |
343 | #### The "Aha!" Moment: The Dominant Eigenvalue
344 |
345 | Let's plug in our actual eigenvalues: `λ₁ = 1.0` and `λ₂ = 0.85`.
346 |
347 | `v₅₀ = c₁ * (1.0^50) * v₁ + c₂ * (0.85^50) * v₂`
348 |
349 | Now, let's look at what these numbers become:
350 | * `1.0^50 = 1`
351 | * `0.85^50 ≈ 0.000296` (this is practically zero!)
352 |
353 | The equation becomes:
354 | `v₅₀ ≈ c₁ * (1) * v₁ + c₂ * (a tiny number) * v₂`
355 | `v₅₀ ≈ c₁v₁`
356 |
357 | **This is the profound insight.**
358 |
359 | After 50 years, the part of our starting vector corresponding to the smaller eigenvalue has decayed away into almost nothing. The final state of the system is almost perfectly aligned with the **dominant eigenvector**, `v₁`.
360 |
361 | The dominant eigenvector isn't just a special case; it is the **destiny** of the system. Regardless of the initial mix `(c₁, c₂)`, as long as there is *some* of the dominant eigenvector in the mix (`c₁ ≠ 0`), the system will converge to that stable state.
362 |
363 | ## Conclusion: The Next Step
364 |
365 | You have now worked through one of the most fundamental concepts in linear algebra. You started with the simple idea of a matrix as an action and asked a powerful question: "Are there any directions that don't change?"
366 |
367 | This led you to the core ideas:
368 | * **Eigenvectors** are the "axes" of a linear transformation—the special directions that remain unchanged.
369 | * **Eigenvalues** are the scaling factors along these axes.
370 | * The relationship `Av = λv` is the key that unlocks the long-term behavior of a system, allowing you to predict its final state by finding its dominant trend.
371 |
372 | This foundation is the gateway to some of the most powerful techniques in data science and engineering.
373 |
374 | #### Teaser 1: Principal Component Analysis (PCA)
375 |
376 | We used eigenvectors to understand a transformation matrix. But what if you just have a huge cloud of data points, like a spreadsheet of customer measurements? PCA is a technique to find the most important patterns in that data.
377 |
378 | It works by first computing a special matrix from the data called the **covariance matrix**. This matrix has a crucial property: it is always **symmetric**. As we saw in our example, this means its eigenvectors are orthogonal (perpendicular).
379 |
380 | * The **dominant eigenvector** of the covariance matrix points in the direction where the data is most spread out—the direction of maximum variance. This is the "most important pattern" or **Principal Component 1**.
381 | * The next eigenvector points in the next most important direction, and so on.
382 |
383 | By using these eigenvectors, you can rotate your data to a new, more insightful perspective or reduce its dimensions by keeping only the most important components. This is a cornerstone of modern data analysis.
384 |
385 | ```
386 | A 2D scatter plot of data points shaped like a tilted ellipse.
387 | The data is spread out most along a diagonal line from bottom-left to top-right.
388 | A long red arrow is drawn along this line, labeled "Principal Component 1 (the dominant eigenvector)".
389 | A shorter blue arrow is drawn perpendicular to the first, along the ellipse's minor axis, labeled "Principal Component 2".
390 | ```
391 |
392 | #### Teaser 2: Singular Value Decomposition (SVD)
393 |
394 | Eigen-analysis is powerful, but it has one major limitation: it only works for **square matrices**. What about rectangular matrices, which are common in data science (e.g., a matrix of `users × movie_ratings`)?
395 |
396 | The **Singular Value Decomposition (SVD)** is the more general and powerful big brother of eigendecomposition. It breaks down *any* matrix `A`—square or rectangular—into three simpler components representing rotation, scaling, and another rotation.
397 |
398 | `A = UΣV^T`
399 |
400 | The "singular values" in the diagonal matrix `Σ` are like eigenvalues; they tell you the "magnitudes" of the transformation. SVD is a master algorithm used in:
401 | * **Image Compression:** By keeping only the largest singular values, you can reconstruct a good approximation of an image with much less data.
402 | * **Recommender Systems:** Used by companies like Netflix to find patterns in the user-item rating matrix and suggest movies you might like.
403 | * **Noise Reduction:** Separating the important "signal" (large singular values) from the "noise" (small singular values) in data.
404 |
405 | You now have the core intuition needed to tackle these advanced, practical, and fascinating topics.
406 |
407 |
408 | ### Conclusion: The Next Step
409 |
410 | You have journeyed from the core idea of an "unchanged direction" to a robust algorithm for finding the hidden properties of any matrix. You now understand that the equation `Av = λv` is not just an abstract exercise; it is a powerful tool for revealing the fundamental axes of a transformation and predicting the long-term behavior of a system.
411 |
412 | You have built the foundation needed to understand one of the most important techniques in modern data science.
413 |
414 | #### The Next Step: Principal Component Analysis (PCA)
415 |
416 | We have seen that eigenvectors are the "axes of transformation" for a matrix. Now, ask a new question: what are the "axes of a dataset"?
417 |
418 | Imagine a cloud of data points, like a scatter plot. PCA is the technique used to find the directions of the most variance—the directions in which the data is most spread out.
419 |
420 | Here is the connection:
421 | 1. From your data, you can compute a **covariance matrix**. This is a symmetric matrix that describes how different features in your data vary with each other.
422 | 2. The **eigenvectors** of this covariance matrix point in the exact directions of maximum variance in your data. These are called the **principal components**.
423 | 3. The **eigenvalue** corresponding to each eigenvector tells you *how much* of the data's total variance lies along that component.
424 |
425 | By finding these eigen-pairs, you can identify the most important patterns in your data. This allows you to perform dimensionality reduction: you can keep the few most important principal components and discard the rest, compressing your data significantly while losing very little information.
426 |
427 | What you have learned today is the engine that drives this powerful and widely-used technique. You are now ready to explore it.
--------------------------------------------------------------------------------
/docs/llm/pretrain.md:
--------------------------------------------------------------------------------
1 | # **Title: LLM Pre-Training in 30 Minutes**
2 |
3 | ### Introduction: The Magic is Just Math
4 |
5 | You've mastered how a neural network learns. You've seen the elegant dance of gradient descent and backpropagation, the mathematical engine that allows a machine to get progressively better at a task by minimizing its error. But that was with numbers. Clean, predictable, logical numbers.
6 |
7 | Now we tackle language.
8 |
9 | When a model like ChatGPT writes code, explains quantum physics, or composes a sonnet, it seems like magic. Here's the secret: it's the *exact same learning process*, applied relentlessly to one deceptively simple task: **predict the next token.**
10 |
11 | In this tutorial, we will deconstruct the GPT-2 pre-training pipeline step by step. You will learn the exact algorithms and mathematics that allow a machine to teach itself from the raw, unlabeled text of the internet. We will build the entire conceptual pipeline, showing you exactly how "The cat sat on the" becomes a prediction for "mat," and how doing this billions of times creates the emergent intelligence you see today.
12 |
13 | We will cover three key stages, anchored by this roadmap:
14 |
15 | ```mermaid
16 | graph LR
17 | subgraph "Part 1 & 2: Data Preparation"
18 | A[Raw Text] --> B{Training Data Algorithm}
19 | B --> C[Input Token IDs]
20 | end
21 | subgraph "Part 3: The Learning Loop"
22 | C --> D["Model (Embeddings + Transformer)"]
23 | D --> E[Output Logits]
24 | E --> F[Softmax -> Probabilities]
25 | F --> G["Cross-Entropy Loss (Error)"]
26 | G -.-> D
27 | end
28 | ```
29 |
30 | By the end, you will understand:
31 | 1. **Data Preparation:** How we turn the entire internet into an infinite source of free training examples.
32 | 2. **The Learning Loop:** How the model makes a prediction, measures its own error, and systematically corrects itself.
33 |
34 | Let's begin by tackling the single biggest bottleneck in the history of AI: data.
35 |
36 | ## **Part 1: The Data Revolution - Learning Without Labels**
37 |
38 | #### The Problem: The Expensive Reality of Supervised Learning
39 |
40 | You've seen neural networks master tasks through supervised learning. The recipe is simple: give the model an input (like an image) and a correct output (the label "cat"), and it learns to map one to the other.
41 |
42 | But there's a massive bottleneck: getting that labeled training data is painfully expensive and fundamentally limiting. Consider the real-world costs:
43 |
44 | * **Medical Imaging:** Radiologists, who charge hundreds of dollars per hour, are needed to label tumors in MRI scans.
45 | * **Legal Documents:** Lawyers, billing even more, are required to classify contracts or find evidence in discovery documents.
46 | * **Scientific Research:** PhD researchers can spend months or years meticulously annotating datasets for their experiments.
47 |
48 | This creates two fundamental problems:
49 |
50 | 1. **The Scale Ceiling:** The famous ImageNet dataset, with its 14 million labeled images, took years and millions of dollars to create. Yet, this is a tiny fraction of the billions of unlabeled images on the internet that we can't use.
51 | 2. **The "Garbage In, Garbage Out" Problem:** The quality of the model is capped by the quality of its labels. Getting high-quality annotations requires true experts, making the process even more expensive and less scalable.
52 |
53 | Supervised learning, for all its power, hits a wall. How do you get to billions or trillions of training examples if every single one requires an expensive human expert?
54 |
55 | #### The Breakthrough: The Self-Supervised Engine
56 |
57 | The genius of models like GPT-2 wasn't just a bigger architecture—it was abandoning the need for human labels entirely. Instead of asking a human, "What is the right answer?", it asks the text itself.
58 |
59 | The task is deceptively simple: **predict the next word.**
60 |
61 | That's it. No human annotation is needed. The text provides both the input (the sequence of words so far) and the "label" (the very next word in the sequence). This is the core of **self-supervised learning**.
62 |
63 | We promised you an algorithm, and here is the simple, powerful engine that turns any document into an almost unlimited supply of training data.
64 |
65 | ```
66 | // ALGORITHM: CreateTrainingData
67 |
68 | INPUT: A document of text, broken into a list of words/tokens T.
69 | T = [t_1, t_2, t_3, ..., t_n]
70 |
71 | OUTPUT: A set of (input, output) pairs for training.
72 |
73 | FOR k FROM 1 TO n-1:
74 | input_sequence = [t_1, ..., t_k]
75 | target_word = t_{k+1}
76 |
77 | ADD (input_sequence, target_word) TO output_set
78 |
79 | RETURN output_set
80 | ```
81 |
82 | #### Step-by-Step Example: Slicing a Sentence
83 |
84 | Let's see this algorithm in action. Take the simple sentence: **"The cat sat on the mat."**
85 |
86 | The training process doesn't see this sentence just once. It systematically slides a window across it, turning one sentence into a full curriculum.
87 |
88 | | | Input Sequence (What the model sees) | Target Output (What it must predict) |
89 | | :--- | :--- | :--- |
90 | | **Example 1** | ["The"] | `cat` |
91 | | **Example 2** | ["The", "cat"] | `sat` |
92 | | **Example 3** | ["The", "cat", "sat"] | `on` |
93 | | **Example 4** | ["The", "cat", "sat", "on"] | `the` |
94 | | **Example 5** | ["The", "cat", "sat", "on", "the"]| `mat` |
95 |
96 | One sentence just generated five high-quality, perfectly labeled training examples for free.
97 |
98 | #### Why Does Predicting the Next Word Create Intelligence?
99 |
100 | At first, this seems too simple. How can guessing the next word teach a model to reason, write code, or explain science?
101 |
102 | Because to get *consistently good* at this task across billions of examples, the model is forced to build a deep, internal understanding of the world. Imagine the model is given the following input text and must predict the single next word:
103 |
104 | **"In Paris, the capital of France, the primary language spoken is..."**
105 |
106 | What must the model learn to accurately predict the word `French`?
107 |
108 | 1. **It needs to understand grammar:** It recognizes that the verb "is" will likely be followed by a noun or adjective.
109 | 2. **It needs to handle long-distance context:** It must connect the end of the sentence back to the subject, "Paris," which appeared many words earlier.
110 | 3. **It needs to learn facts about the world:** It must know that Paris is the capital of France, and that the language spoken in France is French.
111 | 4. **It needs to ignore distractors:** It must realize that the word "primary" is less important than "France" for determining the language.
112 |
113 | The only way for the model to minimize its prediction error across trillions of examples like this is to develop a rich internal model of concepts, facts, and the relationships between them. It's not memorizing; it's learning the underlying patterns of reality as reflected in human language.
114 |
115 | #### Connecting to Reality: The Power of Scale
116 |
117 | This self-supervised approach caused a paradigm shift in the scale of AI.
118 |
119 | | Era | Dataset Example | Size | Parameters | Human Labeling? |
120 | | :--- | :--- | :--- | :--- | :--- |
121 | | **Traditional ML** | MNIST Digits | ~60,000 images (Megabytes) | 1-10 Million | **Yes** |
122 | | **Deep Learning** | ImageNet | 14 Million images (Gigabytes) | 25-150 Million | **Yes** |
123 | | **GPT-2 Era** | WebText | 40GB of text (~8M pages) | **1.5 Billion** | **No** |
124 |
125 | The leap is staggering. A single 2,000-word Wikipedia article is automatically converted into **1,999** individual training examples. Scale that across the 40GB of text GPT-2 was trained on, and you have **billions** of learning opportunities, all for free.
126 |
127 | We've solved the data problem by turning the internet into an infinitely large, self-labeling textbook.
128 |
129 | Now that we understand the *task*, let's tackle the next critical step: how do we turn these words into numbers our neural network can actually process? This is where we move to **Tokenization**.
130 |
131 | ## **Part 2: Tokenization - Turning Language into LEGO Bricks**
132 |
133 | We've established our learning task: predict the next piece of text. But our neural network doesn't understand "text"; it understands numbers. The process of converting raw text into a list of numbers the model can process is called **Tokenization**.
134 |
135 | At first glance, this seems simple. Why not just split sentences by spaces? Or go even smaller and use individual characters? Let's explore why these naive approaches fail.
136 |
137 | Consider the sentence: **"The cat quickly jumped."**
138 |
139 | * **Word-level tokenization** would give us: `["The", "cat", "quickly", "jumped."]`
140 | * **Character-level tokenization** would give us: `["T", "h", "e", " ", "c", "a", "t", ...]`
141 |
142 | Both of these simple methods create immediate and severe problems.
143 |
144 | | Problem | Word-Level Issues | Character-Level Issues |
145 | | :--- | :--- | :--- |
146 | | **Massive Vocabulary** | Is "The" different from "the"? Are "jump", "jumps", and "jumping" all unique words? The vocabulary would need to store every single variation, making it enormous. | Solved. The vocabulary is tiny (A-Z, 0-9, punctuation). |
147 | | **Unknown Words** | What happens with a new word like "hyper-threading" or a typo like "awesommmme"? The model has no entry for it. This is a critical failure point known as the **Out-of-Vocabulary (OOV)** problem. | Solved. Any word can be constructed from characters. |
148 | | **Sequence Length** | Sequences are short and manageable. "The cat jumped." is 4 tokens. | **Massive Inefficiency.** A 4-word sentence becomes over 20 tokens. A paragraph becomes thousands. The model must process each character one by one, making learning patterns across long distances slow and difficult. |
149 |
150 | We need a solution that gives us the best of both worlds: a manageable vocabulary that can still represent any word without creating absurdly long sequences. Modern language models solve this with a clever technique called **Subword Tokenization**.
151 |
152 | #### The LEGO Brick Approach: Subword Tokenization
153 |
154 | The core idea is brilliant: **Don't treat words as the smallest unit.** Instead, break them down into smaller, common pieces, just like building things with LEGO bricks. The tokenizer learns these common pieces from the training data itself.
155 |
156 | Let's see how a real subword tokenizer might handle our examples:
157 |
158 | * The common word "cat" is treated as a single token: `["cat"]`
159 | * The word "quickly" is broken into two common pieces: `["quick", "##ly"]`
160 | * The word "jumping" becomes two familiar parts: `["jump", "##ing"]`
161 |
162 | The `##` is a special symbol that simply means "this token is attached to the previous one." This elegant solution solves all of our earlier problems:
163 |
164 | 1. **It Creates an Efficient Vocabulary:** Instead of needing separate entries for `jump`, `jumps`, `jumping`, and `jumper`, the tokenizer only needs to know the common stem `jump` and the common subwords `##s`, `##ing`, and `##er`. This keeps the vocabulary size manageable (GPT-2 uses about 50,000 tokens).
165 | 2. **It Eliminates Unknown Words:** How does it handle a new, complex word like "hyper-threading"? It can build it from its LEGO bricks: `["hyper", "-", "thread", "##ing"]`. What about a typo like "awesommmme"? It might break it down into `["awesome", "##m", "##m", "##e"]`. The model can represent **any** word by breaking it down into a combination of known subwords and, in the worst case, individual characters.
166 |
167 | #### The Final Output: Integer IDs
168 |
169 | After the text is broken into these subword tokens, the tokenizer looks up each token in its vocabulary to get a unique integer ID. These IDs are what actually get fed into our model as the **Input Tokens** in our architecture diagram.
170 |
171 | Let's imagine a small part of a learned vocabulary:
172 |
173 | | Token | Token ID |
174 | | :--- | :---: |
175 | | "The" | 5 |
176 | | "cat" | 8 |
177 | | "quick" | 73 |
178 | | "##ly" | 152 |
179 | | "jump" | 311 |
180 | | "##ed" | 94 |
181 |
182 | The full tokenization process for "The cat quickly jumped" would look like this:
183 |
184 | 1. **Input Text:** "The cat quickly jumped"
185 | 2. **Subword Splitting:** `["The", "cat", "quick", "##ly", "jump", "##ed"]`
186 | 3. **Final Output (Token IDs):** `[5, 8, 73, 152, 311, 94]`
187 |
188 | This final list of numbers is what represents our sentence. But these numbers are just labels. ID `73` doesn't have any mathematical relationship to ID `8`. They are just arbitrary pointers.
189 |
190 | ***
191 | *A quick clarification: In Part 1, when we said the model predicts the next "word," it's more precise to say it predicts the next **token**. The process is the same, but the model is often working with these subword pieces, not just full words. This allows for a much more flexible and powerful system.*
192 | ***
193 |
194 | #### The Dirty Secret of Tokenization
195 |
196 | Tokenization is a brilliant engineering compromise, but it can confuse models in surprising ways. For example, ask a powerful LLM how many 'r's are in the word "strawberry" and it might struggle.
197 |
198 | Why?
199 |
200 | A human sees the word "strawberry"—one complete object where you can easily count three 'r's.
201 |
202 | But the model might see three abstract, alien symbols: `[$, %, &]`, where:
203 | * `$` means "str"
204 | * `%` means "aw"
205 | * `&` means "berry"
206 |
207 | Now, imagine someone asks you: "How many 'r's are in `$ % &`?"
208 |
209 | You would have to mentally decode each symbol back to its letters, keep track of where the 'r's are across the symbol boundaries, and then count them. That's exactly what the model has to do. The letters 'r' are hidden inside tokens `$` and `&`, split across the token boundaries. The model doesn't naturally "see" individual characters—it sees these learned chunks.
210 |
211 | This is why language models can write beautiful poetry about strawberries but might stumble when counting the letters in the word. The tokenization that makes them efficient also creates blind spots.
212 |
213 | Now that we've turned our text into a clean sequence of token IDs, we need to convert these meaningless IDs into rich, meaningful vectors that our neural network can actually understand. This is the **Embeddings** layer, the input to the Transformer model itself. Let's now jump past the model's internal workings and see how it produces an output.
214 |
215 | ## **Part 3: The Output - From Probabilities to Actual Words**
216 |
217 | We've turned our text into token IDs and fed them through the massive neural network. Now comes the moment of truth: the model must make a prediction. This is where we see the final, elegant math that drives both learning and creativity.
218 |
219 | #### Critical Distinction: Training vs. Generation
220 |
221 | Here is what 90% of people misunderstand about how these models work. There are two distinct processes:
222 |
223 | 1. **During Training,** the model's goal is to output a **probability distribution** over all 50,257 possible tokens. This distribution is then compared to the single correct answer to calculate an error (loss), which is used to update the model's weights. The goal is to get better.
224 | 2. **During Generation** (when you use ChatGPT or an API), the model still produces this probability distribution, but instead of calculating error, it **samples** from this distribution to pick the next word. The goal is to create new text.
225 |
226 | Let's break down each step, starting with the part they have in common.
227 |
228 | ---
229 |
230 | ### **Section 1: Making a Prediction - The Softmax Function**
231 |
232 | After all the complex internal calculations, the model's final layer produces a raw output score—a **logit**—for every single token in its vocabulary. For GPT-2, with its 50,257-token vocabulary, you get 50,257 logits.
233 |
234 | These are just raw numbers. They aren't probabilities yet.
235 |
236 | * They can be positive or negative.
237 | * They don't sum to 1.
238 |
239 | For example, after seeing "The cat sat on the", the logits might look like this:
240 |
241 | `{"mat": 3.2, "rug": 1.3, "floor": 0.5, "moon": -2.1, ... (and 50,253 other scores)}`
242 |
243 | To turn these messy logits into clean probabilities, we use a crucial function called **Softmax**.
244 |
245 | The Softmax formula is:
246 | `probability_of_token_i = exponent(logit_i) / sum_of_all_exponentiated_logits`
247 |
248 | For each token, we raise *e* to the power of its logit, and then divide that by the sum of all the exponentiated logits. This operation guarantees two things:
249 | 1. Every probability will be between 0 and 1.
250 | 2. All the probabilities will sum to exactly 100%.
251 |
252 | Let's see this in action with our simplified example vocabulary:
253 |
254 | | Token | Step 1: Logit Score | Step 2: Exponentiate (e^logit) | Step 3: Divide by Sum (28.32) | Final Probability |
255 | | :--- | :---: | :---: | :---: | :---: |
256 | | "mat" | **3.2** | 24.53 | 24.53 / 28.32 | **86.6%** |
257 | | "rug" | **1.3** | 3.67 | 3.67 / 28.32 | **13.0%** |
258 | | "moon"| **-2.1**| 0.12 | 0.12 / 28.32 | **0.4%** |
259 | | **Total**| | **28.32** | | **100%** |
260 |
261 | The model's official prediction is now a clean probability distribution. We've reached the **Output Probabilities** stage in our diagram. Now, what we do with this depends on whether we are training or generating.
262 |
263 | ---
264 |
265 | ### **Section 2: The Learning Process (Training Only)**
266 |
267 | During training, we know the correct answer was "mat". Our model assigned an 86.6% probability to it. Was that good? How do we turn this into a single error number to drive learning?
268 |
269 | We use a loss function called **Cross-Entropy**.
270 |
271 | The intuition behind Cross-Entropy is to measure **"surprise."**
272 | * If you predict the correct answer with high confidence, you are not surprised (low loss).
273 | * If you predict the correct answer with low confidence, you are very surprised (high loss).
274 |
275 | The full Cross-Entropy formula looks complex, but for next-token prediction, it simplifies beautifully to:
276 |
277 | **`Loss = -log(probability_of_the_correct_token)`**
278 |
279 | That's it. We only care about the probability the model assigned to the single right answer. All other probabilities are ignored because their "true" probability was 0.
280 |
281 | Let's calculate it for our two scenarios:
282 |
283 | 1. **A Good Prediction:**
284 | * The model assigned **86.6%** to the correct token ("mat").
285 | * Loss = -log(0.866) ≈ **0.14**
286 | * This is a small number, which is good! It tells the network it did a good job.
287 |
288 | 2. **A Terrible Prediction:**
289 | * Imagine the model had only assigned **1%** to "mat".
290 | * Loss = -log(0.01) ≈ **4.6**
291 | * This is a much larger number, reflecting high surprise and creating a large error signal to drive learning.
292 |
293 | This single loss number is the starting point for backpropagation. When combined with Softmax, the initial gradient (the direction for correction) simplifies to `Predicted_Probability - True_Probability`. This clean, simple error signal flows backward through the entire network, updating billions of weights to make a slightly better prediction next time.
294 |
295 | ---
296 |
297 | ### **Section 3: Creating Text (Generation Only)**
298 |
299 | When you actually use ChatGPT or call the OpenAI API, you're not training—you're generating text. The model produces the same probabilities, but instead of calculating loss, it has to pick a single word.
300 |
301 | Simply picking the word with the highest probability every time would lead to deterministic, boring, and repetitive text. To create interesting and creative output, the model **samples** from the probability distribution, and we can control this sampling with parameters.
302 |
303 | Here is a typical API call:
304 | ```python
305 | import openai
306 |
307 | response = openai.ChatCompletion.create(
308 | model="gpt-4",
309 | messages=[{"role": "user", "content": "The weather today is"}],
310 | temperature=0.7, # Controls randomness
311 | top_p=0.9, # Controls diversity
312 | )
313 | ```
314 |
315 | What do these parameters actually do?
316 |
317 | **Temperature (Range: 0 to 2)**
318 | * **Intuition:** Controls the "creativity" or randomness of the output.
319 | * **Mechanism:** It modifies the logits *before* the Softmax function: `adjusted_logits = logits / temperature`.
320 | * `temperature < 1.0` (e.g., 0.5): Divides by a smaller number, making the gap between logits larger. This *sharpens* the probabilities, making the model more confident and deterministic. Good for factual answers.
321 | * `temperature > 1.0` (e.g., 1.5): Divides by a larger number, shrinking the gap between logits. This *flattens* the probabilities, increasing the chance of picking a less likely, more "creative" word.
322 |
323 | **Top-p (Nucleus Sampling) (Range: 0 to 1)**
324 | * **Intuition:** Controls the diversity of the output by preventing the model from picking truly nonsensical words.
325 | * **Mechanism:** Instead of considering all 50,257 tokens, it samples from the smallest possible set of tokens whose cumulative probability exceeds the `top_p` value.
326 | * `top_p = 0.1`: Only sample from the most likely tokens that make up the top 10% of the probability mass. This is very focused and safe.
327 | * `top_p = 0.9`: Sample from a much wider "nucleus" of plausible tokens. This allows for more diversity without considering the garbage tokens in the long tail of the distribution.
328 |
329 | Let's see how temperature changes the output for the prompt "The weather today is":
330 |
331 | | Token | Base Prob. (temp=1.0) | Prob. at temp=0.5 (Sharper) | Prob. at temp=2.0 (Flatter) |
332 | | :--- | :---: | :---: | :---: |
333 | | `sunny` | 40% | **~63%** | ~25% |
334 | | `cloudy`| 30% | ~28% | ~22% |
335 | | `rainy` | 20% | ~7% | ~19% |
336 | | `beautiful`| 10% | ~2% | ~14% |
337 |
338 | As you can see, lowering the temperature makes "sunny" an almost certain choice. Raising it makes the probabilities more even, giving a creative word like "beautiful" a real chance to be selected.
339 |
340 | And that is the complete output pipeline: from raw logits to the probabilities that drive both self-correction during training and creative text generation during inference.
341 |
342 | ## **Conclusion: From Raw Knowledge to a Useful Assistant**
343 |
344 | You have now mastered the fundamentals of **pre-training**. You've journeyed through the entire pipeline, from a raw text file on the internet to a sophisticated model capable of predicting the next token with remarkable accuracy.
345 |
346 | Let's recap the core concepts:
347 |
348 | 1. **The Self-Supervised Engine:** We started by solving the biggest problem in AI: the need for expensive, human-labeled data. By framing the task as simple **next-token prediction**, we turned the vast, unlabeled text of the internet into an infinite, free source of training examples.
349 | 2. **The Language-to-Number Bridge:** We saw how **subword tokenization** acts like a set of LEGO bricks, efficiently breaking down any word into manageable pieces that the model can process, represented as a simple list of integer IDs.
350 | 3. **The Learning and Generation Loop:** Finally, we deconstructed the model's output. We learned how the **Softmax** function creates clean probabilities from raw logits, how **Cross-Entropy Loss** uses those probabilities to calculate a "surprise" score that drives learning, and how sampling parameters like **Temperature** and **Top-p** use the very same probabilities to generate creative and coherent text.
351 |
352 | You started this journey seeing language models as magic. But now you know the truth. It's a cascade of brilliant, interconnected ideas, all powered by the simple, elegant process of learning from mistakes on an astronomical scale.
353 |
354 | But our journey isn't over. This pre-trained model is a raw engine of knowledge, not a helpful assistant. This leads to the two crucial questions of what comes next.
355 |
356 | #### What's Next 1: Opening the Black Box - The Transformer
357 |
358 | Throughout this tutorial, we've treated the core of the network—the part that turns input tokens into output logits—as a "black box."
359 |
360 | * *How* does the model actually remember the word "Paris" from ten tokens ago to correctly predict "French"?
361 | * *How* does it weigh the importance of different words in a sentence to understand the true context?
362 |
363 | In the next tutorial, we will finally open that box and explore the revolutionary **Transformer architecture** and its core mechanism: **Self-Attention**. This is the engine that truly understands context, and it's the final piece of the architectural puzzle.
364 |
365 | #### What's Next 2: From Predictor to Assistant - Post-Training
366 |
367 | A next-token predictor is not a chatbot. If you give our pre-trained model the prompt, "What is the capital of France?", its training on internet text might lead it to complete the sentence with another common question, like "...and what is its population?". It's completing a pattern, not answering a question.
368 |
369 | How do we turn this powerful but raw model into a helpful assistant that can follow instructions, answer questions, and refuse to perform harmful tasks?
370 |
371 | That requires a second, crucial stage called **Post-Training**. This involves techniques like:
372 | * **Supervised Fine-Tuning (SFT):** Training the model on a smaller, high-quality dataset of human-written instructions and their ideal responses.
373 | * **Reinforcement Learning from Human Feedback (RLHF):** Allowing humans to rank the model's different answers, teaching it what "helpfulness" and "safety" actually mean through trial and error.
374 |
375 | This is the process that aligns the model with human values—the topic for a future tutorial.
376 |
377 | You now have a solid foundation in how these incredible models are built. The magic has been replaced by understanding, and you're ready to explore the deeper layers of modern artificial intelligence.
--------------------------------------------------------------------------------