├── images ├── Gradient Descent_22_0.png ├── Gradient Descent_35_1.png ├── Gradient Descent_73_1.png ├── Gradient Descent_73_2.png ├── Gradient Descent_78_1.png └── Gradient Descent_78_2.png └── README.md /images/Gradient Descent_22_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arsenyturin/SGD-From-Scratch/HEAD/images/Gradient Descent_22_0.png -------------------------------------------------------------------------------- /images/Gradient Descent_35_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arsenyturin/SGD-From-Scratch/HEAD/images/Gradient Descent_35_1.png -------------------------------------------------------------------------------- /images/Gradient Descent_73_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arsenyturin/SGD-From-Scratch/HEAD/images/Gradient Descent_73_1.png -------------------------------------------------------------------------------- /images/Gradient Descent_73_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arsenyturin/SGD-From-Scratch/HEAD/images/Gradient Descent_73_2.png -------------------------------------------------------------------------------- /images/Gradient Descent_78_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arsenyturin/SGD-From-Scratch/HEAD/images/Gradient Descent_78_1.png -------------------------------------------------------------------------------- /images/Gradient Descent_78_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arsenyturin/SGD-From-Scratch/HEAD/images/Gradient Descent_78_2.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Stochastic Gradient Descent From Scratch 2 | 3 | This notebook illustrates the nature of the Stochastic Gradient Descent (SGD) and walks through all the necessary steps to create SGD from scratch in Python. Gradient Descent is an essential part of many machine learning algorithms, including neural networks. To understand how it works you will need some basic math and logical thinking. Though a stronger math background would be preferable to understand derivatives, I will try to explain them as simple as possible. 4 | 5 | We will work with the California housing dataset and perform a linear regression to predict apartment prices based on the median income in the block. We will start from the simple linear regression and gradually finish with Stochastic Gradient Descent. So let's get started. 6 | 7 | ## Importing Libraries 8 | 9 | 10 | ```python 11 | import numpy as np 12 | import pandas as pd 13 | import matplotlib.pyplot as plt 14 | from sklearn.datasets import california_housing 15 | from sklearn.metrics import mean_squared_error 16 | ``` 17 | 18 | ## California Housing Dataset 19 | 20 | Scikit-learn comes with wide variety of datasets for regression, classification and other problems. Lets load our data into pandas dataframe and take a look. 21 | 22 | 23 | ```python 24 | housing_data = california_housing.fetch_california_housing() 25 | ``` 26 | 27 | 28 | ```python 29 | Features = pd.DataFrame(housing_data.data, columns=housing_data.feature_names) 30 | Target = pd.DataFrame(housing_data.target, columns=['Target']) 31 | ``` 32 | 33 | 34 | ```python 35 | df = Features.join(Target) 36 | ``` 37 | 38 | Features as `MedInc` and `Target` were scaled to some degree. 39 | 40 | 41 | ```python 42 | df.corr() 43 | ``` 44 | 45 | 46 | 47 | 48 |
49 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 |
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeTarget
MedInc1.000000-0.1190340.326895-0.0620400.0048340.018766-0.079809-0.0151760.688075
HouseAge-0.1190341.000000-0.153277-0.077747-0.2962440.0131910.011173-0.1081970.105623
AveRooms0.326895-0.1532771.0000000.847621-0.072213-0.0048520.106389-0.0275400.151948
AveBedrms-0.062040-0.0777470.8476211.000000-0.066197-0.0061810.0697210.013344-0.046701
Population0.004834-0.296244-0.072213-0.0661971.0000000.069863-0.1087850.099773-0.024650
AveOccup0.0187660.013191-0.004852-0.0061810.0698631.0000000.0023660.002476-0.023737
Latitude-0.0798090.0111730.1063890.069721-0.1087850.0023661.000000-0.924664-0.144160
Longitude-0.015176-0.108197-0.0275400.0133440.0997730.002476-0.9246641.000000-0.045967
Target0.6880750.1056230.151948-0.046701-0.024650-0.023737-0.144160-0.0459671.000000
188 |
189 | 190 | 191 | 192 | ## Preprocessing: Removing Outliers and Scaling 193 | 194 | 195 | ```python 196 | df[['MedInc', 'Target']].describe()[1:] #.style.highlight_max(axis=0) 197 | ``` 198 | 199 | 200 | 201 | 202 |
203 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 |
MedIncTarget
mean3.4820301.722805
std1.3649220.749957
min0.4999000.149990
25%2.4520251.119000
50%3.3036001.635000
75%4.3460502.256000
max7.9887003.499000
262 |
263 | 264 | 265 | 266 | It seems that `Target` has some outliers (as well as `MedInc`), because 75% of the data has price less than 2.65, but maximum price go as high as 5. We're going to remove extremely expensive houses as they will add unnecessary noize to the data. 267 | 268 | 269 | ```python 270 | df = df[df.Target < 3.5] 271 | df = df[df.MedInc < 8] 272 | ``` 273 | 274 | ### Removed Outliers 275 | 276 | 277 | ```python 278 | df[['MedInc', 'Target']].describe()[1:] 279 | ``` 280 | 281 | 282 | 283 | 284 |
285 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 |
MedIncTarget
mean3.4820301.722805
std1.3649220.749957
min0.4999000.149990
25%2.4520251.119000
50%3.3036001.635000
75%4.3460502.256000
max7.9887003.499000
344 |
345 | 346 | 347 | 348 | We will also scale `MedInc` and `Target` variables to [0-1]. 349 | 350 | 351 | ```python 352 | def scale(x): 353 | min = x.min() 354 | max = x.max() 355 | return pd.Series([(i - min)/(max - min) for i in x]) 356 | 357 | X = scale(df.MedInc) 358 | y = scale(df.Target) 359 | ``` 360 | 361 | 362 | ```python 363 | X.max(), y.max() # features are scaled now 364 | ``` 365 | 366 | 367 | 368 | 369 | (1.0, 1.0) 370 | 371 | 372 | 373 | ## Correlation Between Price and Income 374 | 375 | Visually we can determine what kind of accuracy we can expect from the models. 376 | 377 | 378 | ```python 379 | plt.figure(figsize=(16,6)) 380 | plt.rcParams['figure.dpi'] = 227 381 | plt.style.use('seaborn-whitegrid') 382 | plt.scatter(X, y, label='Data', c='#388fd8', s=6) 383 | plt.title('Positive Correlation Between Income and House Price', fontSize=15) 384 | plt.xlabel('Income', fontSize=12) 385 | plt.ylabel('House Price', fontSize=12) 386 | plt.legend(frameon=True, loc=1, fontsize=10, borderpad=.6) 387 | plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6) 388 | plt.show() 389 | ``` 390 | 391 | 392 | ![png](images/Gradient%20Descent_22_0.png) 393 | 394 | 395 | Data is quite sparse, but we can still observe some linearity. 396 | 397 | # Simple Linear Regression 398 | 399 | Simple linear regression can be described by only two parameters: slope `m` and intercept `b`, where `x` is our **median income**. Lets take a look at the formulas below: 400 | 401 | # $$\hat{y} = mx + b$$ 402 | 403 | ### $$m = \frac{\overline{x}\overline{y}-\overline{xy}}{(\overline{x})^2 - \overline{x^2}} \quad \textrm{and} \quad b = y-mx$$ 404 | 405 | If we want to add some other features, like size of the apartment, our formula would look like this: $\hat{y} = m_1x_1 + m_2x_2 + b$, where $m_1$ and $m_2$ are slopes for each feature $x_1$ and $x_2$. In this case we would call it multiple linear regression, but we could no longer use formulas above. 406 | 407 | 408 | ```python 409 | class SimpleLinearRegression: 410 | 411 | def fit(self, X, y): 412 | self.X = X 413 | self.y = y 414 | self.m = ((np.mean(X) * np.mean(y) - np.mean(X*y)) / ((np.mean(X)**2) - np.mean(X**2))) 415 | self.b = np.mean(y) - self.m * np.mean(X) 416 | 417 | def coeffs(self): 418 | return self.m, self.b 419 | 420 | def predict(self): 421 | self.y_pred = self.m * self.X + self.b 422 | return self.y_pred 423 | 424 | def r_squared(self): 425 | self.y_mean = np.full((len(self.y)), mean(self.y)) 426 | err_reg = sum((self.y - self.y_pred)**2) 427 | err_y_mean = sum((self.y - self.y_mean)**2) 428 | return (1 - (err_reg/err_y_mean)) 429 | ``` 430 | 431 | 432 | ```python 433 | def plot_regression(X, y, y_pred, log=None, title="Linear Regression"): 434 | 435 | plt.figure(figsize=(16,6)) 436 | plt.rcParams['figure.dpi'] = 227 437 | plt.scatter(X, y, label='Data', c='#388fd8', s=6) 438 | if log != None: 439 | for i in range(len(log)): 440 | plt.plot(X, log[i][0]*X + log[i][1], lw=1, c='#caa727', alpha=0.35) 441 | plt.plot(X, y_pred, c='#ff7702', lw=3, label='Regression') 442 | plt.title(title, fontSize=14) 443 | plt.xlabel('Income', fontSize=11) 444 | plt.ylabel('Price', fontSize=11) 445 | plt.legend(frameon=True, loc=1, fontsize=10, borderpad=.6) 446 | plt.tick_params(direction='out', length=6, color='#a0a0a0', width=1, grid_alpha=.6) 447 | plt.show() 448 | ``` 449 | 450 | 451 | ```python 452 | X = df.MedInc 453 | y = df.Target 454 | ``` 455 | 456 | 457 | ```python 458 | lr = SimpleLinearRegression() 459 | ``` 460 | 461 | 462 | ```python 463 | lr.fit(X, y) 464 | ``` 465 | 466 | 467 | ```python 468 | y_pred = lr.predict() 469 | ``` 470 | 471 | 472 | ```python 473 | print("MSE:",mean_squared_error(y, y_pred)) 474 | plot_regression(X, y, y_pred, title="Linear Regression") 475 | ``` 476 | 477 | MSE: 0.34320521502255963 478 | 479 | 480 | 481 | ![png](images/Gradient%20Descent_35_1.png) 482 | 483 | 484 | Result of our model is the regression line. Just by looking at the graph we can tell that data points go well above and beyond our line, making predictions approximate. 485 | 486 | ## Multiple Linear Regression with Least Squares 487 | 488 | Similar to `from sklearn.linear_model import LinearRegression`, we can calculate coefficients with Least Squares method. Numpy can calculate this formula almost instantly (of course depends on the amount of data) and precise. 489 | 490 | ## $$ m =(A^TA)^{-1} A^Ty $$ 491 | 492 | ### $$m - parameters, \: A - data, \: y - target$$ 493 | 494 | 495 | ```python 496 | X = df.drop('Target', axis=1) # matrix A, or all the features 497 | y = df.Target 498 | ``` 499 | 500 | 501 | ```python 502 | class MultipleLinearRegression: 503 | ''' 504 | Multiple Linear Regression with Least Squares 505 | ''' 506 | def fit(self, X, y): 507 | X = np.array(X) 508 | y = np.array(y) 509 | self.coeffs = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y) 510 | 511 | def predict(self, X): 512 | X = np.array(X) 513 | result = np.zeros(len(X)) 514 | for i in range(X.shape[1]): 515 | result += X[:, i] * self.coeffs[i] 516 | return result 517 | 518 | def coeffs(self): 519 | return self.coeffs 520 | ``` 521 | 522 | 523 | ```python 524 | mlp = MultipleLinearRegression() 525 | ``` 526 | 527 | 528 | ```python 529 | mlp.fit(X, y) 530 | ``` 531 | 532 | 533 | ```python 534 | y_pred = mlp.predict(X) 535 | ``` 536 | 537 | 538 | ```python 539 | mean_squared_error(y, y_pred) 540 | ``` 541 | 542 | 543 | 544 | 545 | 0.2912984534321039 546 | 547 | 548 | 549 | # Gradient Descent 550 | 551 | ### Abstract 552 | 553 | The idea behind gradient descent is simple - by gradually tuning parameters, such as slope (`m`) and the intercept (`b`) in our regression function `y = mx + b`, we minimize cost. 554 | By cost, we usually mean some kind of a function that tells us how far off our model predicted result. For regression problems we often use `mean squared error` (MSE) cost function. If we use gradient descent for the classification problem, we will have a different set of parameters to tune. 555 | 556 | ### $$ MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y_i})^2 \quad \textrm{where} \quad \hat{y_i} = mx_i + b $$ 557 | 558 | Now we have to figure out how to tweak parameters `m` and `b` to reduce MSE. 559 | 560 | ### Partial Derivatives 561 | 562 | We use partial derivatives to find how each individual parameter affects MSE, so that's where word _partial_ comes from. In simple words, we take the derivative with respect to `m` and `b` **separately**. Take a look at the formula below. It looks almost exactly the same as MSE, but this time we added f(m, b) to it. It essentially changes nothing, except now we can plug `m` and `b` numbers into it and calculate the result. 563 | 564 | ### $$𝑓(𝑚,𝑏)= \frac{1}{n}\sum_{i=1}^{n}(y_i - (mx_i+b))^2$$ 565 | 566 | This formula (or better say function) is better representation for further calculations of partial derivatives. We can ignore sum for now and what comes before that and focus only on $y - (mx + b)^2$. 567 | 568 | ### Partical Derivative With Respect to `m` 569 | 570 | With respect to `m` means we derive parameter `m` and basically ignore what is going on with `b`, or we can say its 0. To derive with respect to `m` we will use chain rule. 571 | 572 | # $$ [f(g(x))]' = f'(g(x)) * g(x)' \: - \textrm{chain rule}$$ 573 | 574 | Chain rule applies when one function sits inside of another. If you're new to this, you'd be surprised that $()^2$ is outside function, and $y-(\boldsymbol{m}x+b)$ sits inside it. So, the chain rule says that we should take a derivative of outside function, keep inside function unchanged and then multiply by derivative of the inside function. Lets write these steps down: 575 | 576 | # $$ (y - (mx + b))^2 $$ 577 | 578 | 1. Derivative of $()^2$ is $2()$, same as $x^2$ becomes $2x$ 579 | 2. We do nothing with $y - (mx + b)$, so it stays the same 580 | 3. Derivative of $y - (mx + b)$ with respect to **_m_** is $(0 - (x + 0))$ or $-x$, because **_y_** and **_b_** are constants, they become 0, and derivative of **_mx_** is **_x_** 581 | 582 | Multiply all parts we get following: $2 * (y - (mx+b)) * -x$. 583 | Looks nicer if we move -x to the left: $-2x *(y-(mx+b))$. There we have it. The final version of our derivative is the following: 584 | 585 | ### $$\frac{\partial f}{\partial m} = \frac{1}{n}\sum_{i=1}^{n}-2x_i(y_i - (mx_i+b))$$ 586 | 587 | Here, $\frac{df}{dm}$ means we find partial derivative of function f (we mentioned it earlier) with respect to m. We plug our derivative to the summation and we're done. 588 | 589 | ### Partical Derivative With Respect to `b` 590 | 591 | Same rules apply to the derivative with respect to b. 592 | 593 | 1. $()^2$ becomes $2()$, same as $x^2$ becomes $2x$ 594 | 2. $y - (mx + b)$ stays the same 595 | 3. $y - (mx + b)$ becomes $(0 - (0 + 1))$ or $-1$, because **_y_** and **_mx_** are constants, they become 0, and derivative of **_b_** is 1 596 | 597 | Multiply all the parts together and we get $-2(y-(mx+b))$ 598 | 599 | ### $$\frac{\partial f}{\partial b} = \frac{1}{n}\sum_{i=1}^{n}-2(y_i - (mx_i+b))$$ 600 | 601 | ### Final Function 602 | 603 | Few details we should discuss befor jumping into code: 604 | 1. Gradient descent is an iterative process and with each iteration (`epoch`) we slightly minimizing MSE, so each time we use our derived functions to update parameters `m` and `b` 605 | 2. Because its iterative, we should choose how many iterations we take, or make algorithm stop when we approach minima of MSE. In other words when algorithm is no longer improving MSE, we know it reached minimum. 606 | 3. Gradient descent has an additional parameter learning rate (`lr`), which helps control how fast or slow algorithm going towards minima of MSE 607 | 608 | Thats about it. So you can already understand that Gradient Descent for the most part is just process of taking derivatives and using them over and over to minimize function. 609 | 610 | 611 | ```python 612 | def gradient_descent(X, y, lr=0.05, epoch=10): 613 | 614 | ''' 615 | Gradient Descent for a single feature 616 | ''' 617 | 618 | m, b = 0.2, 0.2 # parameters 619 | log, mse = [], [] # lists to store learning process 620 | N = len(X) # number of samples 621 | 622 | for _ in range(epoch): 623 | 624 | f = y - (m*X + b) 625 | 626 | # Updating m and b 627 | m -= lr * (-2 * X.dot(f).sum() / N) 628 | b -= lr * (-2 * f.sum() / N) 629 | 630 | log.append((m, b)) 631 | mse.append(mean_squared_error(y, (m*X + b))) 632 | 633 | return m, b, log, mse 634 | ``` 635 | 636 | ### Predicting House Price With Gradient Descent 637 | 638 | 639 | ```python 640 | X = df.MedInc 641 | y = df.Target 642 | 643 | m, b, log, mse = gradient_descent(X, y, lr=0.01, epoch=100) 644 | 645 | y_pred = m*X + b 646 | 647 | print("MSE:",mean_squared_error(y, y_pred)) 648 | plot_regression(X, y, y_pred, log=log, title="Linear Regression with Gradient Descent") 649 | 650 | plt.figure(figsize=(16,3)) 651 | plt.rcParams['figure.dpi'] = 227 652 | plt.plot(range(len(mse)), mse) 653 | plt.title('Gradient Descent Optimization', fontSize=14) 654 | plt.xlabel('Epochs') 655 | plt.ylabel('MSE') 656 | plt.show() 657 | ``` 658 | 659 | MSE: 0.3493097403876614 660 | 661 | 662 | 663 | ![png](images/Gradient%20Descent_73_1.png) 664 | 665 | 666 | 667 | ![png](images/Gradient%20Descent_73_2.png) 668 | 669 | 670 | ## Stochastic Gradient Descent 671 | 672 | Stochastic Gradient Descent works almost the same as Gradient Descent (also called Batch Gradient Descent), but instead of training on entire dataset, it picks only one sample to update `m` and `b` parameters, which makes it much faster. In the function below I made possible to change sample size (`batch_size`), because sometimes its better to use more than one sample at a time. 673 | 674 | 675 | ```python 676 | def SGD(X, y, lr=0.05, epoch=10, batch_size=1): 677 | 678 | ''' 679 | Stochastic Gradient Descent for a single feature 680 | ''' 681 | 682 | m, b = 0.5, 0.5 # initial parameters 683 | log, mse = [], [] # lists to store learning process 684 | 685 | for _ in range(epoch): 686 | 687 | indexes = np.random.randint(0, len(X), batch_size) # random sample 688 | 689 | Xs = np.take(X, indexes) 690 | ys = np.take(y, indexes) 691 | N = len(Xs) 692 | 693 | f = ys - (m*Xs + b) 694 | 695 | # Updating parameters m and b 696 | m -= lr * (-2 * Xs.dot(f).sum() / N) 697 | b -= lr * (-2 * f.sum() / N) 698 | 699 | log.append((m, b)) 700 | mse.append(mean_squared_error(y, m*X+b)) 701 | 702 | return m, b, log, mse 703 | ``` 704 | 705 | 706 | ```python 707 | m, b, log, mse = SGD(X, y, lr=0.01, epoch=100, batch_size=2) 708 | ``` 709 | 710 | 711 | ```python 712 | y_pred = m*X + b 713 | 714 | print("MSE:",mean_squared_error(y, y_pred)) 715 | plot_regression(X, y, y_pred, log=log, title="Linear Regression with SGD") 716 | 717 | plt.figure(figsize=(16,3)) 718 | plt.rcParams['figure.dpi'] = 227 719 | plt.plot(range(len(mse)), mse) 720 | plt.title('SGD Optimization', fontSize=14) 721 | plt.xlabel('Epochs', fontSize=11) 722 | plt.ylabel('MSE', fontSize=11) 723 | plt.show() 724 | ``` 725 | 726 | MSE: 0.3462919845446769 727 | 728 | 729 | 730 | ![png](images/Gradient%20Descent_78_1.png) 731 | 732 | 733 | 734 | ![png](images/Gradient%20Descent_78_2.png) 735 | 736 | 737 | We can observe how regression line went up and down to find right parameters and MSE not as smooth as regular gradient descent. 738 | 739 | ## Speed Test for Gradient Descent vs SGD 740 | 741 | 742 | ```python 743 | X = df.MedInc 744 | y = df.Target 745 | ``` 746 | 747 | 748 | ```python 749 | X = np.concatenate((X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X)) 750 | y = np.concatenate((y,y,y,y,y,y,y,y,y,y,y,y,y,y,y,y,y)) 751 | ``` 752 | 753 | 754 | ```python 755 | X.shape, y.shape 756 | ``` 757 | 758 | 759 | 760 | 761 | ((304946,), (304946,)) 762 | 763 | 764 | 765 | 766 | ```python 767 | %timeit SGD(X, y, lr=0.01, epoch=1000, batch_size=1) 768 | ``` 769 | 770 | 1.22 s ± 8.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 771 | 772 | 773 | 774 | ```python 775 | %timeit gradient_descent(X, y, lr=0.01, epoch=1000) 776 | ``` 777 | 778 | 2.02 s ± 79.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 779 | 780 | 781 | ## Conclusion 782 | 783 | 1. SGD is twice as fast as Gradient Descent (also called Batch Gradient Descent) 784 | 2. On sparse data, we can increase the batch size to speed up learning process. It's not a pure form of SGD, but we can call it a mini-batch SGD 785 | 3. Smaller learning rate helps to prevent overfitting but can be adjusted accordingly 786 | --------------------------------------------------------------------------------