├── .gitignore ├── LICENSE ├── README.md ├── data └── .gitkeep ├── graph_computation ├── pagerank.py └── transitive_closure.py ├── machine_learning ├── k-means.py └── logistic_regression.py ├── matrix_computation └── matrix_decomposition.py ├── optimization ├── asgd.py ├── bmuf.py ├── easgd.py ├── hogwild!.py ├── ma.py ├── ssgd.py └── ssgd_pytorch.py ├── pic └── DistributedML-cover.jpeg ├── randomized_algorithm └── monte_carlo.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Plot results 10 | *.png 11 | 12 | # VSCode 13 | .vscode/ 14 | 15 | # OS generated files 16 | .DS_Store 17 | 18 | # Distribution / packaging 19 | .Python 20 | build/ 21 | develop-eggs/ 22 | dist/ 23 | downloads/ 24 | eggs/ 25 | .eggs/ 26 | lib/ 27 | lib64/ 28 | parts/ 29 | sdist/ 30 | var/ 31 | wheels/ 32 | pip-wheel-metadata/ 33 | share/python-wheels/ 34 | *.egg-info/ 35 | .installed.cfg 36 | *.egg 37 | MANIFEST 38 | 39 | # data 40 | data/* 41 | !data/.gitkeep 42 | 43 | # PyInstaller 44 | # Usually these files are written by a python script from a template 45 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 46 | *.manifest 47 | *.spec 48 | 49 | # Installer logs 50 | pip-log.txt 51 | pip-delete-this-directory.txt 52 | 53 | # Unit test / coverage reports 54 | htmlcov/ 55 | .tox/ 56 | .nox/ 57 | .coverage 58 | .coverage.* 59 | .cache 60 | nosetests.xml 61 | coverage.xml 62 | *.cover 63 | *.py,cover 64 | .hypothesis/ 65 | .pytest_cache/ 66 | 67 | # Translations 68 | *.mo 69 | *.pot 70 | 71 | # Django stuff: 72 | *.log 73 | local_settings.py 74 | db.sqlite3 75 | db.sqlite3-journal 76 | 77 | # Flask stuff: 78 | instance/ 79 | .webassets-cache 80 | 81 | # Scrapy stuff: 82 | .scrapy 83 | 84 | # Sphinx documentation 85 | docs/_build/ 86 | 87 | # PyBuilder 88 | target/ 89 | 90 | # Jupyter Notebook 91 | .ipynb_checkpoints 92 | 93 | # IPython 94 | profile_default/ 95 | ipython_config.py 96 | 97 | # pyenv 98 | .python-version 99 | 100 | # pipenv 101 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 102 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 103 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 104 | # install all needed dependencies. 105 | #Pipfile.lock 106 | 107 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 108 | __pypackages__/ 109 | 110 | # Celery stuff 111 | celerybeat-schedule 112 | celerybeat.pid 113 | 114 | # SageMath parsed files 115 | *.sage.py 116 | 117 | # Environments 118 | .env 119 | .venv 120 | env/ 121 | venv/ 122 | ENV/ 123 | env.bak/ 124 | venv.bak/ 125 | 126 | # Spyder project settings 127 | .spyderproject 128 | .spyproject 129 | 130 | # Rope project settings 131 | .ropeproject 132 | 133 | # mkdocs documentation 134 | /site 135 | 136 | # mypy 137 | .mypy_cache/ 138 | .dmypy.json 139 | dmypy.json 140 | 141 | # Pyre type checker 142 | .pyre/ 143 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 HongYu Zhang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 9 |

10 | 11 |

12 | 13 |
14 | 15 | # 分布式机器学习 16 | 📚 *如果船长的最高目标是保住他的船,那么他只能永远待在港口。* 17 | 18 | [![Open Source Love](https://badges.frapsoft.com/os/v2/open-source.svg?v=103)](https://github.com/orion-orion/Distributed-Algorithm-PySpark)[![](https://img.shields.io/github/license/orion-orion/Distributed-Algorithm-PySpark)](https://github.com/orion-orion/Distributed-Algorithm-PySpark/blob/master/LICENSE)[![](https://img.shields.io/github/stars/orion-orion/Distributed-ML-PySpark?style=social)](https://github.com/orion-orion/Distributed-ML-PySpark) 19 |
20 | [![](https://img.shields.io/github/directory-file-count/orion-orion/Distributed-ML-PySpark)](https://github.com/orion-orion/Distributed-ML-PySpark) [![](https://img.shields.io/github/languages/code-size/orion-orion/Distributed-ML-PySpark)](https://github.com/orion-orion/Distributed-ML-PySpark) 21 |
22 | 23 | ## 1 简介 24 | 本项目为经典分布式机器学习算法的的PySpark/Pytorch实现, 主要参考了刘铁岩的《分布式机器学习》和[CME 323: Distributed Algorithms and Optimization](https://stanford.edu/~rezab/classes/cme323/S17/)课程。主要内容包括图/矩阵计算(graph/matrix computation)、随机算法、优化(optimization)和机器学习。 25 | 26 | ## 2 环境依赖 27 | 28 | 运行以下命令安装环境依赖: 29 | ``` 30 | pip install -r requirements.txt 31 | ``` 32 | 33 | 注意我的Python版本是3.8.13,Java版本11.0.15。注意PySpark是运行与Java虚拟机上的,且只支持Java 8/11,请勿使用更高级的版本。这里我使用的是Java 11。运行`java -version`可查看本机Java版本。 34 | ```shell 35 | (base) ➜ ~ java -version 36 | java version "11.0.15" 2022-04-19 LTS 37 | Java(TM) SE Runtime Environment 18.9 (build 11.0.15+8-LTS-149) 38 | Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.15+8-LTS-149, mixed mode) 39 | ``` 40 | 最后,Pytorch的`torch.distributed.rpc`模块只支持Linux操作系统,故务必保证您在Linux操作系统上运行相关代码,否则会报错(参见[GitHub issues: torch.distributed.rpc](https://github.com/iffiX/machin/issues/17))。 41 | 42 | ## 3 目录 43 | 44 | - 图计算 45 | - PageRank [[explanation]](https://www.cnblogs.com/orion-orion/p/16340839.html) 46 | - Transitive Closure 47 | - 机器学习 48 | - K-means 49 | - Logistic Regression [[explanation]](https://www.cnblogs.com/orion-orion/p/16318810.html) 50 | - 矩阵计算 51 | - Matrix Decomposition 52 | - 数值优化 53 | - 同步算法 54 | - Synchronous Stochastic Gradient Descent (SSGD) [[explanation]](https://www.cnblogs.com/orion-orion/p/16413182.html) [[paper]](https://proceedings.neurips.cc/paper/2010/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf) 55 | - SSGD in Pytorch [[explanation]](https://www.cnblogs.com/orion-orion/p/16413182.html) [[paper]](https://proceedings.neurips.cc/paper/2010/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf) 56 | - Model Average (MA) [[explanation]](https://www.cnblogs.com/orion-orion/p/16426982.html) [[paper]](https://aclanthology.org/N10-1069.pdf) 57 | - Block-wise Model Update Filtering (BMUF) [[explanation]](https://www.cnblogs.com/orion-orion/p/16426982.html) [[paper]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf) 58 | - Elastic Averaging Stochastic Gradient Descent (EASGD) [[explanation]](https://www.cnblogs.com/orion-orion/p/16426982.html) [[paper]](https://proceedings.neurips.cc/paper/2015/file/d18f655c3fce66ca401d5f38b48c89af-Paper.pdf) 59 | - 异步算法 60 | - Synchronous Stochastic Gradient Descent (ASGD)[[explanation]](https://www.cnblogs.com/orion-orion/p/17118029.html) [[paper]](https://proceedings.neurips.cc/paper/2011/file/f0e52b27a7a5d6a1a87373dffa53dbe5-Paper.pdf) 61 | - Hogwild! [[explanation]](https://www.cnblogs.com/orion-orion/p/17118029.html) [[paper]](https://proceedings.neurips.cc/paper/2011/file/218a0aefd1d1a4be65601cc6ddc1520e-Paper.pdf) 62 | - 随机算法 63 | - Monte Carlo Method 64 | -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/orion-orion/Distributed-ML-PySpark/051790d6bc8d034cfa6af19e7d4f820f4c1fa6d6/data/.gitkeep -------------------------------------------------------------------------------- /graph_computation/pagerank.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-05-31 14:14:35 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:48:23 8 | ''' 9 | import sys 10 | from operator import add 11 | from typing import Iterable, Tuple 12 | from pyspark.resultiterable import ResultIterable 13 | from pyspark.sql import SparkSession 14 | import os 15 | 16 | os.environ['PYSPARK_PYTHON'] = sys.executable 17 | 18 | n_threads = 4 # Number of local threads 19 | n_iterations = 10 # Number of iterations 20 | q = 0.15 #the default value of q is 0.15 21 | 22 | def computeContribs(neighbors: ResultIterable[int], rank: float) -> Iterable[Tuple[int, float]]: 23 | # Calculates the contribution(rank/num_neighbors) of each vertex, and send it to its neighbours. 24 | num_neighbors = len(neighbors) 25 | for vertex in neighbors: 26 | yield (vertex, rank / num_neighbors) 27 | 28 | if __name__ == "__main__": 29 | # Initialize the spark context. 30 | spark = SparkSession\ 31 | .builder\ 32 | .appName("PageRank")\ 33 | .master("local[%d]" % n_threads)\ 34 | .getOrCreate() 35 | 36 | # link: (source_id, dest_id) 37 | links = spark.sparkContext.parallelize( 38 | [(1, 2), (1, 3), (2, 3), (3, 1)], 39 | ) 40 | 41 | # drop duplicate links and convert links to an adjacency list. 42 | adj_list = links.distinct().groupByKey().cache() 43 | 44 | # count the number of vertexes 45 | n_vertexes = adj_list.count() 46 | 47 | # init the rank of each vertex, the default is 1.0/n_vertexes 48 | ranks = adj_list.map(lambda vertex_neighbors: (vertex_neighbors[0], 1.0/n_vertexes)) 49 | 50 | # Calculates and updates vertex ranks continuously using PageRank algorithm. 51 | for t in range(n_iterations): 52 | # Calculates the contribution(rank/num_neighbors) of each vertex, and send it to its neighbours. 53 | contribs = adj_list.join(ranks).flatMap(lambda vertex_neighbors_rank: computeContribs( 54 | vertex_neighbors_rank[1][0], vertex_neighbors_rank[1][1] # type: ignore[arg-type] 55 | )) 56 | 57 | # Re-calculates rank of each vertex based on the contributions it received 58 | ranks = contribs.reduceByKey(add).mapValues(lambda rank: q/n_vertexes + (1 - q)*rank) 59 | 60 | # Collects all ranks of vertexs and dump them to console. 61 | for (vertex, rank) in ranks.collect(): 62 | print("%s has rank: %s." % (vertex, rank)) 63 | 64 | spark.stop() 65 | 66 | 67 | # 1 has rank: 0.38891305880091237. 68 | # 2 has rank: 0.214416470596171. 69 | # 3 has rank: 0.3966704706029163. -------------------------------------------------------------------------------- /graph_computation/transitive_closure.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-07-01 22:04:00 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:51:50 8 | ''' 9 | from pyspark.sql import SparkSession 10 | import sys 11 | import os 12 | 13 | os.environ['PYSPARK_PYTHON'] = sys.executable 14 | 15 | n_threads = 4 # Number of local threads 16 | 17 | if __name__ == "__main__": 18 | spark = SparkSession\ 19 | .builder\ 20 | .appName("Transitive Closure")\ 21 | .master("local[%d]" % n_threads)\ 22 | .getOrCreate() 23 | 24 | paths = spark.sparkContext.parallelize([(1, 2), (1, 3), (2, 3), (3, 1)]).cache() 25 | 26 | # Linear transitive closure: each round grows paths by one edge, 27 | # by joining the the already-discovered paths with graph's edges. 28 | # e.g. join the path (y, z) from the paths with the edge (x, y) from 29 | # the graph to obtain the new path (x, z). 30 | 31 | 32 | # The edges are stored in reversed order because they are about to be joined. 33 | edges = paths.map(lambda x_y: (x_y[1], x_y[0])) 34 | 35 | old_cnt = 0 36 | next_cnt = paths.count() 37 | while True: 38 | old_cnt = next_cnt 39 | # Perform the join, obtaining an RDD of (y, (z, x)) pairs, 40 | # then map the result to obtain the new (x, z) paths. 41 | new_paths = paths.join(edges).map(lambda vertexes: (vertexes[1][1], vertexes[1][0])) 42 | # union new paths 43 | paths = paths.union(new_paths).distinct().cache() 44 | next_cnt = paths.count() 45 | if next_cnt == old_cnt: 46 | break 47 | 48 | print("The original graph has %i paths" % paths.count()) 49 | 50 | spark.stop() -------------------------------------------------------------------------------- /machine_learning/k-means.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-06-30 21:53:37 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:52:42 8 | ''' 9 | import random 10 | from typing import List, Tuple 11 | import numpy as np 12 | from pyspark.sql import SparkSession 13 | import matplotlib.pyplot as plt 14 | import sys 15 | import os 16 | 17 | os.environ['PYSPARK_PYTHON'] = sys.executable 18 | 19 | k = 2 20 | convergeDist = 0.1 21 | n_threads = 4 # Number of local threads 22 | n_iterations = 5 23 | 24 | def closest_center(p: np.ndarray, centers: List[np.ndarray]) -> int: 25 | closest_cid = 0 26 | min_dist = float("+inf") 27 | for cid in range(len(centers)): 28 | dist = np.sqrt(np.sum((p - centers[cid]) ** 2)) 29 | if dist < min_dist: 30 | min_dist = dist 31 | closest_cid = cid 32 | return closest_cid 33 | 34 | def display_clusters(center_to_point: List[Tuple]): 35 | clusters = dict([ (c_id, []) for c_id in range(k)]) 36 | for c_id, (p, _) in center_to_point: 37 | clusters[c_id].append(p) 38 | 39 | for c_id, points in clusters.items(): 40 | points = np.array(points) 41 | color = "#"+''.join([random.choice('0123456789ABCDEF') for i in range(6)]) 42 | plt.scatter(points[:, 0], points[:, 1], c=color) 43 | 44 | plt.savefig("kmeans_clusters_display.png") 45 | 46 | 47 | if __name__ == "__main__": 48 | spark = SparkSession\ 49 | .builder\ 50 | .appName("K-means")\ 51 | .master("local[%d]" % n_threads)\ 52 | .getOrCreate() 53 | 54 | matrix = np.array([[1, 2], [1, 4], [1, 0], 55 | [10, 2], [10, 4], [10, 0]]) 56 | points = spark.sparkContext.parallelize(matrix).cache() 57 | 58 | k_centers = points.takeSample(False, k, 42) 59 | 60 | for t in range(n_iterations): 61 | # assign each point to the center closest to it. 62 | center_to_point = points.map( 63 | lambda p: (closest_center(p, k_centers), (p, 1))) 64 | 65 | # for each cluster(points shareing the some center), 66 | # compute the sum of vecters in it and the size of it. 67 | cluster_stats = center_to_point.reduceByKey( 68 | lambda p1_cnt1, p2_cnt2: (p1_cnt1[0] + p2_cnt2[0], p1_cnt1[1] + p2_cnt2[1])) 69 | 70 | # for each cluster, compute the mean vecter. 71 | mean_vecters = cluster_stats.map( 72 | lambda stat: (stat[0], stat[1][0] / stat[1][1])).collect() 73 | 74 | # update the centers. 75 | for (c_id, mean_vecter) in mean_vecters: 76 | k_centers[c_id] = mean_vecter 77 | 78 | print("Final centers: " + str(k_centers)) 79 | 80 | if matrix.shape[1] == 2: 81 | display_clusters(center_to_point.collect()) 82 | 83 | spark.stop() -------------------------------------------------------------------------------- /machine_learning/logistic_regression.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-05-26 21:02:38 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-01 16:22:53 8 | ''' 9 | from sklearn.datasets import load_breast_cancer 10 | import numpy as np 11 | from pyspark.sql import SparkSession 12 | from operator import add 13 | from sklearn.model_selection import train_test_split 14 | from sklearn.metrics import accuracy_score 15 | import matplotlib.pyplot as plt 16 | import sys 17 | import os 18 | 19 | os.environ['PYSPARK_PYTHON'] = sys.executable 20 | 21 | n_threads = 4 # Number of local threads 22 | n_iterations = 1500 # Number of iterations 23 | eta = 0.1 # iteration step_size 24 | 25 | def logistic_f(x, w): 26 | return 1 / (np.exp(-x.dot(w)) + 1) 27 | 28 | 29 | def gradient(point: np.ndarray, w: np.ndarray) -> np.ndarray: 30 | """ Compute linear regression gradient for a matrix of data points 31 | """ 32 | y = point[-1] # point label 33 | x = point[:-1] # point coordinate 34 | # For each point (x, y), compute gradient function, then sum these up 35 | return - (y - logistic_f(x, w)) * x 36 | 37 | def draw_acc_plot(accs, n_iterations): 38 | def ewma_smooth(accs, alpha=0.9): 39 | s_accs = np.zeros(n_iterations) 40 | for idx, acc in enumerate(accs): 41 | if idx == 0: 42 | s_accs[idx] = acc 43 | else: 44 | s_accs[idx] = alpha * s_accs[idx-1] + (1 - alpha) * acc 45 | return s_accs 46 | 47 | s_accs = ewma_smooth(accs, alpha=0.9) 48 | plt.plot(np.arange(1, n_iterations + 1), accs, color="C0", alpha=0.3) 49 | plt.plot(np.arange(1, n_iterations + 1), s_accs, color="C0") 50 | plt.title(label="Accuracy on test dataset") 51 | plt.xlabel("Round") 52 | plt.ylabel("Accuracy") 53 | plt.savefig("logistic_regression_acc_plot.png") 54 | 55 | 56 | if __name__ == "__main__": 57 | 58 | X, y = load_breast_cancer(return_X_y=True) 59 | 60 | D = X.shape[1] 61 | X_train, X_test, y_train, y_test = train_test_split( 62 | X, y, test_size=0.3, random_state=0) 63 | n_train, n_test = X_train.shape[0], X_test.shape[0] 64 | 65 | spark = SparkSession\ 66 | .builder\ 67 | .appName("Logistic Regression")\ 68 | .master("local[%d]" % n_threads)\ 69 | .getOrCreate() 70 | 71 | matrix = np.concatenate( 72 | [X_train, np.ones((n_train, 1)), y_train.reshape(-1, 1)], axis=1) 73 | 74 | points = spark.sparkContext.parallelize(matrix).cache() 75 | 76 | # Initialize w to a random value 77 | w = 2 * np.random.ranf(size=D + 1) - 1 78 | print("Initial w: " + str(w)) 79 | 80 | accs = [] 81 | for t in range(n_iterations): 82 | print("On iteration %d" % (t + 1)) 83 | w_br = spark.sparkContext.broadcast(w) 84 | 85 | # g = points.map(lambda point: gradient(point, w)).reduce(add) 86 | # g = points.map(lambda point: gradient(point, w_br.value)).reduce(add) 87 | g = points.map(lambda point: gradient(point, w_br.value))\ 88 | .treeAggregate(0.0, add, add) 89 | 90 | w -= eta * g 91 | 92 | y_pred = logistic_f(np.concatenate( 93 | [X_test, np.ones((n_test, 1))], axis=1), w) 94 | pred_label = np.where(y_pred < 0.5, 0, 1) 95 | acc = accuracy_score(y_test, pred_label) 96 | accs.append(acc) 97 | print("iterations: %d, accuracy: %f" % (t, acc)) 98 | 99 | print("Final w: %s " % w) 100 | print("Final acc: %f" % acc) 101 | 102 | spark.stop() 103 | 104 | draw_acc_plot(accs, n_iterations) 105 | 106 | # Final w: [ 1.16200213e+04 1.30671054e+04 6.53960395e+04 2.13003287e+04 107 | # 8.92852998e+01 -1.09553416e+02 -2.98667851e+02 -1.26433988e+02 108 | # 1.59947852e+02 7.85600857e+01 -3.90622568e+01 8.09490631e+02 109 | # -1.29356637e+03 -4.02060982e+04 4.22124893e+00 -2.30863864e+01 110 | # -4.22144623e+01 -9.06373487e+00 1.16047444e+01 9.14892224e-01 111 | # 1.25920286e+04 1.53120086e+04 6.48615769e+04 -3.23661608e+04 112 | # 1.00625479e+02 -3.98123440e+02 -6.89846039e+02 -1.77214836e+02 113 | # 1.95991193e+02 5.96495248e+01 1.53245784e+03] 114 | # Final acc: 0.941520 115 | 116 | -------------------------------------------------------------------------------- /matrix_computation/matrix_decomposition.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-06-30 19:32:44 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:51:14 8 | ''' 9 | import numpy as np 10 | from pyspark.sql import SparkSession 11 | import sys 12 | import os 13 | 14 | os.environ['PYSPARK_PYTHON'] = sys.executable 15 | 16 | lam = 0.01 # regularization coefficient 17 | m = 100 # number of users 18 | n = 500 # number of items 19 | k = 10 # dim of the latent vectors of users and items 20 | n_iterations = 5 # number of iterations 21 | n_threads = 4 # Number of local threads 22 | 23 | def rmse(R: np.ndarray, U: np.ndarray, V: np.ndarray) -> np.float64: 24 | diff = R - U @ V.T 25 | return np.sqrt(np.sum(np.power(diff, 2)) / (m * n)) 26 | 27 | 28 | def update(i: int, mat: np.ndarray, ratings: np.ndarray) -> np.ndarray: 29 | X_dim = mat.shape[0] 30 | 31 | XtX = mat.T @ mat 32 | Xty = mat.T @ ratings[i, :].T 33 | 34 | for i in range(k): 35 | XtX[i, i] += lam * X_dim 36 | 37 | return np.linalg.solve(XtX, Xty) 38 | 39 | 40 | if __name__ == "__main__": 41 | spark = SparkSession\ 42 | .builder\ 43 | .appName("Matrix Decomposition")\ 44 | .master("local[%d]" % n_threads)\ 45 | .getOrCreate() 46 | 47 | R = np.random.rand(m, k) @ (np.random.rand(n, k).T) 48 | U = np.random.rand(m, k) 49 | V = np.random.rand(n, k) 50 | 51 | R_br = spark.sparkContext.broadcast(R) 52 | U_br = spark.sparkContext.broadcast(U) 53 | V_br = spark.sparkContext.broadcast(V) 54 | 55 | # we use the alternating least squares (ALS) to solve the SVD problem 56 | for t in range(n_iterations): 57 | U_ = spark.sparkContext.parallelize(range(m)) \ 58 | .map(lambda x: update(x, V_br.value, R_br.value)) \ 59 | .collect() 60 | 61 | # collect() returns a list, so we need to convert it to a 2-d array 62 | U = np.array(U_) 63 | U_br = spark.sparkContext.broadcast(U) 64 | 65 | V_ = spark.sparkContext.parallelize(range(n)) \ 66 | .map(lambda x: update(x, U_br.value, R_br.value.T)) \ 67 | .collect() 68 | V = np.array(V_) 69 | V_br = spark.sparkContext.broadcast(V) 70 | 71 | error = rmse(R, U, V) 72 | print("iterations: %d, rmse: %f" % (t, error)) 73 | 74 | spark.stop() -------------------------------------------------------------------------------- /optimization/asgd.py: -------------------------------------------------------------------------------- 1 | import os 2 | import threading 3 | from datetime import datetime 4 | import torch 5 | import torch.distributed.rpc as rpc 6 | import torch.multiprocessing as mp 7 | import torch.nn as nn 8 | from torch import optim 9 | import torchvision 10 | from torchvision import datasets, transforms 11 | import torch.nn.functional as F 12 | from torch.utils.data import Subset 13 | 14 | 15 | batch_size = 20 16 | n_workers = 5 17 | epochs = 10 18 | seed = 1 19 | log_interval = 10 # how many epochs to wait before logging training status 20 | cuda = True # enables CUDA training 21 | mps = False # enables macOS GPU training 22 | use_cuda = cuda and torch.cuda.is_available() 23 | use_mps = mps and torch.backends.mps.is_available() 24 | if use_cuda: 25 | device = torch.device("cuda") 26 | elif use_mps: 27 | device = torch.device("mps") 28 | else: 29 | device = torch.device("cpu") 30 | 31 | 32 | class CustomSubset(Subset): 33 | '''A custom subset class with customizable data transformation''' 34 | def __init__(self, dataset, indices, subset_transform=None): 35 | super().__init__(dataset, indices) 36 | self.subset_transform = subset_transform 37 | 38 | def __getitem__(self, idx): 39 | x, y = self.dataset[self.indices[idx]] 40 | if self.subset_transform: 41 | x = self.subset_transform(x) 42 | return x, y 43 | 44 | def __len__(self): 45 | return len(self.indices) 46 | 47 | 48 | def dataset_split(dataset, n_workers): 49 | n_samples = len(dataset) 50 | n_sample_per_workers = n_samples // n_workers 51 | local_datasets = [] 52 | for w_id in range(n_workers): 53 | if w_id < n_workers - 1: 54 | local_datasets.append(CustomSubset(dataset, range(w_id * n_sample_per_workers, (w_id + 1) * n_sample_per_workers))) 55 | else: 56 | local_datasets.append(CustomSubset(dataset, range(w_id * n_sample_per_workers, n_samples))) 57 | return local_datasets 58 | 59 | 60 | class Net(nn.Module): 61 | def __init__(self): 62 | super(Net, self).__init__() 63 | self.conv1 = nn.Conv2d(1, 10, kernel_size=5) 64 | self.conv2 = nn.Conv2d(10, 20, kernel_size=5) 65 | self.conv2_drop = nn.Dropout2d() 66 | self.fc1 = nn.Linear(320, 50) 67 | self.fc2 = nn.Linear(50, 10) 68 | 69 | def forward(self, x): 70 | x = F.relu(F.max_pool2d(self.conv1(x), 2)) 71 | x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) 72 | x = x.view(-1, 320) 73 | x = F.relu(self.fc1(x)) 74 | x = F.dropout(x, training=self.training) 75 | x = self.fc2(x) 76 | return F.log_softmax(x, dim=1) 77 | 78 | class ParameterServer(object): 79 | 80 | def __init__(self, n_workers=n_workers): 81 | self.model = Net().to(device) 82 | self.lock = threading.Lock() 83 | self.future_model = torch.futures.Future() 84 | self.n_workers = n_workers 85 | self.curr_update_size = 0 86 | self.optimizer = optim.SGD(self.model.parameters(), lr=0.001, momentum=0.9) 87 | for p in self.model.parameters(): 88 | p.grad = torch.zeros_like(p) 89 | self.test_loader = torch.utils.data.DataLoader( 90 | datasets.MNIST('../data', train=False, 91 | transform=transforms.Compose([ 92 | transforms.ToTensor(), 93 | transforms.Normalize((0.1307,), (0.3081,)) 94 | ])), 95 | batch_size=32, shuffle=True) 96 | 97 | 98 | def get_model(self): 99 | # TensorPipe RPC backend only supports CPU tensors, 100 | # so we move your tensors to CPU before sending them over RPC 101 | return self.model.to("cpu") 102 | 103 | @staticmethod 104 | @rpc.functions.async_execution 105 | def update_and_fetch_model(ps_rref, grads): 106 | self = ps_rref.local_value() 107 | for p, g in zip(self.model.parameters(), grads): 108 | p.grad += g 109 | with self.lock: 110 | self.curr_update_size += 1 111 | fut = self.future_model 112 | 113 | if self.curr_update_size >= self.n_workers: 114 | for p in self.model.parameters(): 115 | p.grad /= self.n_workers 116 | self.curr_update_size = 0 117 | self.optimizer.step() 118 | self.optimizer.zero_grad() 119 | fut.set_result(self.model) 120 | self.future_model = torch.futures.Future() 121 | 122 | return fut 123 | 124 | def evaluation(self): 125 | self.model.eval() 126 | self.model = self.model.to(device) 127 | test_loss = 0 128 | correct = 0 129 | with torch.no_grad(): 130 | for data, target in self.test_loader: 131 | output = self.model(data.to(device)) 132 | test_loss += F.nll_loss(output, target.to(device), reduction='sum').item() # sum up batch loss 133 | pred = output.max(1)[1] # get the index of the max log-probability 134 | correct += pred.eq(target.to(device)).sum().item() 135 | 136 | test_loss /= len(self.test_loader.dataset) 137 | print('\nTest result - Accuracy: {}/{} ({:.0f}%)\n'.format( 138 | correct, len(self.test_loader.dataset), 100. * correct / len(self.test_loader.dataset))) 139 | 140 | 141 | class Trainer(object): 142 | 143 | def __init__(self, ps_rref): 144 | self.ps_rref = ps_rref 145 | self.model = Net().to(device) 146 | 147 | def train(self, train_dataset): 148 | train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) 149 | model = self.ps_rref.rpc_sync().get_model().cuda() 150 | pid = os.getpid() 151 | for epoch in range(epochs): 152 | for batch_idx, (data, target) in enumerate(train_loader): 153 | output = model(data.to(device)) 154 | loss = F.nll_loss(output, target.to(device)) 155 | loss.backward() 156 | model = rpc.rpc_sync( 157 | self.ps_rref.owner(), 158 | ParameterServer.update_and_fetch_model, 159 | args=(self.ps_rref, [p.grad for p in model.cpu().parameters()]), 160 | ).cuda() 161 | if batch_idx % log_interval == 0: 162 | print('{}\tTrain Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( 163 | pid, epoch + 1, batch_idx * len(data), len(train_loader.dataset), 164 | 100. * batch_idx / len(train_loader), loss.item())) 165 | 166 | 167 | 168 | def run_trainer(ps_rref, train_dataset): 169 | trainer = Trainer(ps_rref) 170 | trainer.train(train_dataset) 171 | 172 | 173 | def run_ps(trainers): 174 | transform=transforms.Compose([ 175 | transforms.ToTensor(), 176 | transforms.Normalize((0.1307,), (0.3081,)) 177 | ]) 178 | train_dataset = datasets.MNIST('../data', train=True, download=True, 179 | transform=transform) 180 | local_train_datasets = dataset_split(train_dataset, n_workers) 181 | 182 | 183 | print(f"{datetime.now().strftime('%H:%M:%S')} Start training") 184 | ps = ParameterServer() 185 | ps_rref = rpc.RRef(ps) 186 | futs = [] 187 | for idx, trainer in enumerate(trainers): 188 | futs.append( 189 | rpc.rpc_async(trainer, run_trainer, args=(ps_rref, local_train_datasets[idx])) 190 | ) 191 | 192 | torch.futures.wait_all(futs) 193 | print(f"{datetime.now().strftime('%H:%M:%S')} Finish training") 194 | ps.evaluation() 195 | # Test result - Accuracy: 9696/10000 (97%) 196 | 197 | def run(rank, world_size): 198 | os.environ['MASTER_ADDR'] = 'localhost' 199 | os.environ['MASTER_PORT'] = '29500' 200 | options=rpc.TensorPipeRpcBackendOptions( 201 | num_worker_threads=16, 202 | rpc_timeout=0 # infinite timeout 203 | ) 204 | if rank == 0: 205 | rpc.init_rpc( 206 | "ps", 207 | rank=rank, 208 | world_size=world_size, 209 | rpc_backend_options=options 210 | ) 211 | run_ps([f"trainer{r}" for r in range(1, world_size)]) 212 | else: 213 | rpc.init_rpc( 214 | f"trainer{rank}", 215 | rank=rank, 216 | world_size=world_size, 217 | rpc_backend_options=options 218 | ) 219 | # trainer passively waiting for ps to kick off training iterations 220 | 221 | # block until all rpcs finish 222 | rpc.shutdown() 223 | 224 | 225 | if __name__=="__main__": 226 | world_size = n_workers + 1 227 | mp.spawn(run, args=(world_size, ), nprocs=world_size, join=True) 228 | -------------------------------------------------------------------------------- /optimization/bmuf.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-05-26 21:02:38 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:50:46 8 | ''' 9 | from typing import Tuple 10 | from sklearn.datasets import load_breast_cancer 11 | import numpy as np 12 | from pyspark.sql import SparkSession 13 | from operator import add 14 | from sklearn.model_selection import train_test_split 15 | from sklearn.metrics import accuracy_score 16 | import matplotlib.pyplot as plt 17 | import sys 18 | import os 19 | 20 | os.environ['PYSPARK_PYTHON'] = sys.executable 21 | 22 | n_threads = 4 # Number of local threads 23 | n_iterations = 300 # Number of iterations 24 | eta = 0.1 25 | mini_batch_fraction = 0.1 # the fraction of mini batch sample 26 | n_local_iterations = 5 # the number local epochs 27 | mu = 0.9 28 | zeta = 0.1 29 | 30 | def logistic_f(x, w): 31 | return 1 / (np.exp(-x.dot(w)) + 1 +1e-6) 32 | 33 | 34 | def gradient(pt_w: Tuple): 35 | """ Compute linear regression gradient for a matrix of data points 36 | """ 37 | idx, (point, w) = pt_w 38 | y = point[-1] # point label 39 | x = point[:-1] # point coordinate 40 | # For each point (x, y), compute gradient function, then sum these up 41 | return (idx, (w, - (y - logistic_f(x, w)) * x)) 42 | 43 | 44 | def update_local_w(iter): 45 | iter = list(iter) 46 | idx, (w, _) = iter[0] 47 | g_mean = np.mean(np.array([ g for _, (_, g) in iter]), axis=0) 48 | return [(idx, w - eta * g_mean)] 49 | 50 | 51 | def draw_acc_plot(accs, n_iterations): 52 | def ewma_smooth(accs, alpha=0.9): 53 | s_accs = np.zeros(n_iterations) 54 | for idx, acc in enumerate(accs): 55 | if idx == 0: 56 | s_accs[idx] = acc 57 | else: 58 | s_accs[idx] = alpha * s_accs[idx-1] + (1 - alpha) * acc 59 | return s_accs 60 | 61 | s_accs = ewma_smooth(accs, alpha=0.9) 62 | plt.plot(np.arange(1, n_iterations + 1), accs, color="C0", alpha=0.3) 63 | plt.plot(np.arange(1, n_iterations + 1), s_accs, color="C0") 64 | plt.title(label="Accuracy on test dataset") 65 | plt.xlabel("Round") 66 | plt.ylabel("Accuracy") 67 | plt.savefig("bmuf_acc_plot.png") 68 | 69 | 70 | if __name__ == "__main__": 71 | 72 | X, y = load_breast_cancer(return_X_y=True) 73 | 74 | D = X.shape[1] 75 | 76 | X_train, X_test, y_train, y_test = train_test_split( 77 | X, y, test_size=0.3, random_state=0, shuffle=True) 78 | n_train, n_test = X_train.shape[0], X_test.shape[0] 79 | 80 | spark = SparkSession\ 81 | .builder\ 82 | .appName("BMUF")\ 83 | .master("local[%d]" % n_threads)\ 84 | .getOrCreate() 85 | 86 | matrix = np.concatenate( 87 | [X_train, np.ones((n_train, 1)), y_train.reshape(-1, 1)], axis=1) 88 | 89 | points = spark.sparkContext.parallelize(matrix).cache() 90 | points = points.mapPartitionsWithIndex(lambda idx, iter: [ (idx, arr) for arr in iter]) 91 | 92 | ws = spark.sparkContext.parallelize(2 * np.random.ranf(size=(n_threads, D + 1)) - 1).cache() 93 | ws = ws.mapPartitionsWithIndex(lambda idx, iter: [(idx, next(iter))]) 94 | 95 | w = 2 * np.random.ranf(size=D + 1) - 1 96 | print("Initial w: " + str(w)) 97 | 98 | # weight update 99 | delta_w = 2 * np.random.ranf(size=D + 1) - 1 100 | 101 | accs = [] 102 | for t in range(n_iterations): 103 | print("On iteration %d" % (t + 1)) 104 | w_br = spark.sparkContext.broadcast(w) 105 | ws = ws.mapPartitions(lambda iter: [(iter[0][0], w_br.value)]) 106 | 107 | for local_t in range(n_local_iterations): 108 | ws = points.sample(False, mini_batch_fraction, 42 + t)\ 109 | .join(ws, numPartitions=n_threads)\ 110 | .map(lambda pt_w: gradient(pt_w))\ 111 | .mapPartitions(update_local_w) 112 | 113 | par_w_sum = ws.mapPartitions(lambda iter: [iter[0][1]]).treeAggregate(0.0, add, add) 114 | 115 | w_avg = par_w_sum / n_threads 116 | 117 | delta_w = mu * delta_w + zeta * (w_avg - w) 118 | w = w + delta_w 119 | 120 | y_pred = logistic_f(np.concatenate( 121 | [X_test, np.ones((n_test, 1))], axis=1), w) 122 | pred_label = np.where(y_pred < 0.5, 0, 1) 123 | acc = accuracy_score(y_test, pred_label) 124 | accs.append(acc) 125 | print("iterations: %d, accuracy: %f" % (t, acc)) 126 | 127 | print("Final w: %s " % w) 128 | print("Final acc: %f" % acc) 129 | 130 | spark.stop() 131 | 132 | draw_acc_plot(accs, n_iterations) 133 | 134 | 135 | # Final w: [ 3.41516794e+01 5.11372499e+01 2.04081002e+02 1.03632914e+02 136 | # -7.95309541e+00 6.00459407e+00 -9.58634353e+00 -4.56611790e+00 137 | # -3.12493046e+00 7.20375548e+00 -6.13087884e+00 5.02524913e+00 138 | # -9.99930137e+00 -1.26079312e+02 -7.53719022e+00 -4.93277200e-01 139 | # -9.28534294e+00 -7.81058362e+00 1.78073479e+00 -1.49910377e-01 140 | # 3.93256717e+01 7.52357494e+01 2.09020272e+02 -1.33107647e+02 141 | # 8.22423217e+00 7.29714646e+00 -8.21168535e+00 -4.55323584e-02 142 | # 2.08715673e+00 -9.04949770e+00 -9.35055238e-01] 143 | # Final acc: 0.929825 144 | -------------------------------------------------------------------------------- /optimization/easgd.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-05-26 21:02:38 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:50:30 8 | ''' 9 | from typing import Tuple 10 | from sklearn.datasets import load_breast_cancer 11 | import numpy as np 12 | from pyspark.sql import SparkSession 13 | from operator import add 14 | from sklearn.model_selection import train_test_split 15 | from sklearn.metrics import accuracy_score 16 | import matplotlib.pyplot as plt 17 | import sys 18 | import os 19 | 20 | os.environ['PYSPARK_PYTHON'] = sys.executable 21 | 22 | n_threads = 4 # Number of local threads 23 | n_iterations = 1500 # Number of iterations 300 24 | eta = 0.1 25 | mini_batch_fraction = 0.1 # the fraction of mini batch sample 26 | rho = 0.1 # penalty constraint coefficient 27 | alpha = eta * rho # iterative constraint coefficient 28 | beta = n_threads * alpha # the parameter of history information 29 | 30 | def logistic_f(x, w): 31 | return 1 / (np.exp(-x.dot(w)) + 1 +1e-6) 32 | 33 | 34 | def gradient(pt_w: Tuple): 35 | """ Compute linear regression gradient for a matrix of data points 36 | """ 37 | idx, (point, w) = pt_w 38 | y = point[-1] # point label 39 | x = point[:-1] # point coordinate 40 | # For each point (x, y), compute gradient function, then sum these up 41 | return (idx, (w, - (y - logistic_f(x, w)) * x)) 42 | 43 | 44 | def update_local_w(iter, w): 45 | iter = list(iter) 46 | idx, (local_w, _) = iter[0] 47 | g_mean = np.mean(np.array([ g for _, (_, g) in iter]), axis=0) 48 | return [(idx, local_w - eta * g_mean - alpha * (local_w - w))] 49 | 50 | 51 | def draw_acc_plot(accs, n_iterations): 52 | def ewma_smooth(accs, alpha=0.9): 53 | s_accs = np.zeros(n_iterations) 54 | for idx, acc in enumerate(accs): 55 | if idx == 0: 56 | s_accs[idx] = acc 57 | else: 58 | s_accs[idx] = alpha * s_accs[idx-1] + (1 - alpha) * acc 59 | return s_accs 60 | 61 | s_accs = ewma_smooth(accs, alpha=0.9) 62 | plt.plot(np.arange(1, n_iterations + 1), accs, color="C0", alpha=0.3) 63 | plt.plot(np.arange(1, n_iterations + 1), s_accs, color="C0") 64 | plt.title(label="Accuracy on test dataset") 65 | plt.xlabel("Round") 66 | plt.ylabel("Accuracy") 67 | plt.savefig("easgd_acc_plot.png") 68 | 69 | 70 | if __name__ == "__main__": 71 | 72 | X, y = load_breast_cancer(return_X_y=True) 73 | 74 | D = X.shape[1] 75 | 76 | X_train, X_test, y_train, y_test = train_test_split( 77 | X, y, test_size=0.3, random_state=0, shuffle=True) 78 | n_train, n_test = X_train.shape[0], X_test.shape[0] 79 | 80 | spark = SparkSession\ 81 | .builder\ 82 | .appName("EASGD")\ 83 | .master("local[%d]" % n_threads)\ 84 | .getOrCreate() 85 | 86 | matrix = np.concatenate( 87 | [X_train, np.ones((n_train, 1)), y_train.reshape(-1, 1)], axis=1) 88 | 89 | points = spark.sparkContext.parallelize(matrix).cache() 90 | points = points.mapPartitionsWithIndex(lambda idx, iter: [ (idx, arr) for arr in iter]) 91 | 92 | ws = spark.sparkContext.parallelize(2 * np.random.ranf(size=(n_threads, D + 1)) - 1).cache() 93 | ws = ws.mapPartitionsWithIndex(lambda idx, iter: [(idx, next(iter))]) 94 | 95 | w = 2 * np.random.ranf(size=D + 1) - 1 96 | print("Initial w: " + str(w)) 97 | 98 | accs = [] 99 | for t in range(n_iterations): 100 | print("On iteration %d" % (t + 1)) 101 | w_br = spark.sparkContext.broadcast(w) 102 | 103 | ws = points.sample(False, mini_batch_fraction, 42 + t)\ 104 | .join(ws, numPartitions=n_threads)\ 105 | .map(lambda pt_w: gradient(pt_w))\ 106 | .mapPartitions(lambda iter: update_local_w(iter, w=w_br.value)) 107 | 108 | par_w_sum = ws.mapPartitions(lambda iter: [iter[0][1]]).treeAggregate(0.0, add, add) 109 | 110 | w = (1 - beta) * w + beta * par_w_sum / n_threads 111 | 112 | y_pred = logistic_f(np.concatenate( 113 | [X_test, np.ones((n_test, 1))], axis=1), w) 114 | pred_label = np.where(y_pred < 0.5, 0, 1) 115 | acc = accuracy_score(y_test, pred_label) 116 | accs.append(acc) 117 | print("iterations: %d, accuracy: %f" % (t, acc)) 118 | 119 | print("Final w: %s " % w) 120 | print("Final acc: %f" % acc) 121 | 122 | spark.stop() 123 | 124 | draw_acc_plot(accs, n_iterations) 125 | 126 | 127 | # Final w: [ 4.41003205e+01 6.87756972e+01 2.59527758e+02 1.43995756e+02 128 | # 1.13597321e-01 -2.85033742e-01 -5.97111145e-01 -2.77260275e-01 129 | # 4.96300761e-01 3.30914106e-01 -2.22883276e-01 4.26915865e+00 130 | # -2.62994199e+00 -1.43839576e+02 -1.78751529e-01 2.54613165e-01 131 | # -8.19158564e-02 4.12327013e-01 -1.13116759e-01 -2.01949538e-01 132 | # 4.56239359e+01 8.74703134e+01 2.62017432e+02 -1.77434224e+02 133 | # 3.78336511e-01 -4.12976475e-01 -1.31121349e+00 -3.16414474e-01 134 | # 9.83796876e-01 2.30045103e-01 5.34560392e+00] 135 | # Final acc: 0.929825 136 | 137 | -------------------------------------------------------------------------------- /optimization/hogwild!.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | import torch.multiprocessing as mp 6 | from torchvision import datasets, transforms 7 | import os 8 | import torch 9 | import torch.optim as optim 10 | import torch.nn.functional as F 11 | 12 | 13 | batch_size = 64 # input batch size for training 14 | test_batch_size = 1000 # input batch size for testing 15 | epochs = 10 # number of global epochs to train 16 | lr = 0.01 # learning rate 17 | momentum = 0.5 # SGD momentum 18 | seed = 1 # random seed 19 | log_interval = 10 # how many batches to wait before logging training status 20 | n_workers = 4 # how many training processes to use 21 | cuda = True # enables CUDA training 22 | mps = False # enables macOS GPU training 23 | dry_run = False # quickly check a single pass 24 | 25 | 26 | def train(rank, model, device, dataset, dataloader_kwargs): 27 | torch.manual_seed(seed + rank) 28 | 29 | train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs) 30 | 31 | optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum) 32 | for epoch in range(1, epochs + 1): 33 | model.train() 34 | pid = os.getpid() 35 | for batch_idx, (data, target) in enumerate(train_loader): 36 | optimizer.zero_grad() 37 | output = model(data.to(device)) 38 | loss = F.nll_loss(output, target.to(device)) 39 | loss.backward() 40 | optimizer.step() 41 | if batch_idx % log_interval == 0: 42 | print('{}\tTrain Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( 43 | pid, epoch, batch_idx * len(data), len(train_loader.dataset), 44 | 100. * batch_idx / len(train_loader), loss.item())) 45 | if dry_run: 46 | break 47 | 48 | 49 | def test(model, device, dataset, dataloader_kwargs): 50 | torch.manual_seed(seed) 51 | test_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs) 52 | 53 | model.eval() 54 | test_loss = 0 55 | correct = 0 56 | with torch.no_grad(): 57 | for data, target in test_loader: 58 | output = model(data.to(device)) 59 | test_loss += F.nll_loss(output, target.to(device), reduction='sum').item() # sum up batch loss 60 | pred = output.max(1)[1] # get the index of the max log-probability 61 | correct += pred.eq(target.to(device)).sum().item() 62 | 63 | test_loss /= len(test_loader.dataset) 64 | print('\nTest set: Global loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( 65 | test_loss, correct, len(test_loader.dataset), 66 | 100. * correct / len(test_loader.dataset))) 67 | 68 | 69 | class Net(nn.Module): 70 | def __init__(self): 71 | super(Net, self).__init__() 72 | self.conv1 = nn.Conv2d(1, 10, kernel_size=5) 73 | self.conv2 = nn.Conv2d(10, 20, kernel_size=5) 74 | self.conv2_drop = nn.Dropout2d() 75 | self.fc1 = nn.Linear(320, 50) 76 | self.fc2 = nn.Linear(50, 10) 77 | 78 | def forward(self, x): 79 | x = F.relu(F.max_pool2d(self.conv1(x), 2)) 80 | x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) 81 | x = x.view(-1, 320) 82 | x = F.relu(self.fc1(x)) 83 | x = F.dropout(x, training=self.training) 84 | x = self.fc2(x) 85 | return F.log_softmax(x, dim=1) 86 | 87 | 88 | if __name__ == '__main__': 89 | use_cuda = cuda and torch.cuda.is_available() 90 | use_mps = mps and torch.backends.mps.is_available() 91 | if use_cuda: 92 | device = torch.device("cuda") 93 | elif use_mps: 94 | device = torch.device("mps") 95 | else: 96 | device = torch.device("cpu") 97 | 98 | print(device) 99 | 100 | transform=transforms.Compose([ 101 | transforms.ToTensor(), 102 | transforms.Normalize((0.1307,), (0.3081,)) 103 | ]) 104 | train_dataset = datasets.MNIST('../data', train=True, download=True, 105 | transform=transform) 106 | test_dataset = datasets.MNIST('../data', train=False, 107 | transform=transform) 108 | kwargs = {'batch_size': batch_size, 109 | 'shuffle': True} 110 | if use_cuda: 111 | kwargs.update({'num_workers': 1, 112 | 'pin_memory': True, 113 | }) 114 | 115 | torch.manual_seed(seed) 116 | mp.set_start_method('spawn', force=True) 117 | 118 | model = Net().to(device) 119 | model.share_memory() # gradients are allocated lazily, so they are not shared here 120 | 121 | processes = [] 122 | for rank in range(n_workers): 123 | p = mp.Process(target=train, args=(rank, model, device, 124 | train_dataset, kwargs)) 125 | # We first train the model across `n_workers` processes 126 | p.start() 127 | processes.append(p) 128 | 129 | for p in processes: 130 | p.join() 131 | 132 | # Once training is complete, we can test the model 133 | test(model, device, test_dataset, kwargs) 134 | # Test set: Global loss: 0.0325, Accuracy: 9898/10000 (99%) -------------------------------------------------------------------------------- /optimization/ma.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-05-26 21:02:38 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-01 16:25:54 8 | ''' 9 | from typing import Tuple 10 | from sklearn.datasets import load_breast_cancer 11 | import numpy as np 12 | from pyspark.sql import SparkSession 13 | from operator import add 14 | from sklearn.model_selection import train_test_split 15 | from sklearn.metrics import accuracy_score 16 | import matplotlib.pyplot as plt 17 | import sys 18 | import os 19 | 20 | os.environ['PYSPARK_PYTHON'] = sys.executable 21 | 22 | n_threads = 4 # Number of local threads 23 | n_iterations = 300 # Number of iterations 24 | eta = 0.1 25 | mini_batch_fraction = 0.1 # the fraction of mini batch sample 26 | n_local_iterations = 5 # the number local epochs 27 | 28 | def logistic_f(x, w): 29 | return 1 / (np.exp(-x.dot(w)) + 1 +1e-6) 30 | 31 | 32 | def gradient(pt_w: Tuple): 33 | """ Compute linear regression gradient for a matrix of data points 34 | """ 35 | idx, (point, w) = pt_w 36 | y = point[-1] # point label 37 | x = point[:-1] # point coordinate 38 | # For each point (x, y), compute gradient function, then sum these up 39 | return (idx, (w, - (y - logistic_f(x, w)) * x)) 40 | 41 | 42 | def update_local_w(iter): 43 | iter = list(iter) 44 | idx, (w, _) = iter[0] 45 | g_mean = np.mean(np.array([ g for _, (_, g) in iter]), axis=0) 46 | return [(idx, w - eta * g_mean)] 47 | 48 | 49 | def draw_acc_plot(accs, n_iterations): 50 | def ewma_smooth(accs, alpha=0.9): 51 | s_accs = np.zeros(n_iterations) 52 | for idx, acc in enumerate(accs): 53 | if idx == 0: 54 | s_accs[idx] = acc 55 | else: 56 | s_accs[idx] = alpha * s_accs[idx-1] + (1 - alpha) * acc 57 | return s_accs 58 | 59 | s_accs = ewma_smooth(accs, alpha=0.9) 60 | plt.plot(np.arange(1, n_iterations + 1), accs, color="C0", alpha=0.3) 61 | plt.plot(np.arange(1, n_iterations + 1), s_accs, color="C0") 62 | plt.title(label="Accuracy on test dataset") 63 | plt.xlabel("Round") 64 | plt.ylabel("Accuracy") 65 | plt.savefig("ma_acc_plot.png") 66 | 67 | 68 | if __name__ == "__main__": 69 | 70 | X, y = load_breast_cancer(return_X_y=True) 71 | 72 | D = X.shape[1] 73 | 74 | X_train, X_test, y_train, y_test = train_test_split( 75 | X, y, test_size=0.3, random_state=0, shuffle=True) 76 | n_train, n_test = X_train.shape[0], X_test.shape[0] 77 | 78 | spark = SparkSession\ 79 | .builder\ 80 | .appName("Model Average")\ 81 | .master("local[%d]" % n_threads)\ 82 | .getOrCreate() 83 | 84 | matrix = np.concatenate( 85 | [X_train, np.ones((n_train, 1)), y_train.reshape(-1, 1)], axis=1) 86 | 87 | points = spark.sparkContext.parallelize(matrix).cache() 88 | points = points.mapPartitionsWithIndex(lambda idx, iter: [ (idx, arr) for arr in iter]) 89 | 90 | ws = spark.sparkContext.parallelize(2 * np.random.ranf(size=(n_threads, D + 1)) - 1).cache() 91 | ws = ws.mapPartitionsWithIndex(lambda idx, iter: [(idx, next(iter))]) 92 | 93 | w = 2 * np.random.ranf(size=D + 1) - 1 94 | print("Initial w: " + str(w)) 95 | 96 | accs = [] 97 | for t in range(n_iterations): 98 | print("On iteration %d" % (t + 1)) 99 | w_br = spark.sparkContext.broadcast(w) 100 | ws = ws.mapPartitions(lambda iter: [(iter[0][0], w_br.value)]) 101 | 102 | for local_t in range(n_local_iterations): 103 | ws = points.sample(False, mini_batch_fraction, 42 + t)\ 104 | .join(ws, numPartitions=n_threads)\ 105 | .map(lambda pt_w: gradient(pt_w))\ 106 | .mapPartitions(update_local_w) 107 | 108 | par_w_sum = ws.mapPartitions(lambda iter: [iter[0][1]]).treeAggregate(0.0, add, add) 109 | 110 | w = par_w_sum / n_threads 111 | 112 | y_pred = logistic_f(np.concatenate( 113 | [X_test, np.ones((n_test, 1))], axis=1), w) 114 | pred_label = np.where(y_pred < 0.5, 0, 1) 115 | acc = accuracy_score(y_test, pred_label) 116 | accs.append(acc) 117 | print("iterations: %d, accuracy: %f" % (t, acc)) 118 | 119 | print("Final w: %s " % w) 120 | print("Final acc: %f" % acc) 121 | 122 | spark.stop() 123 | 124 | draw_acc_plot(accs, n_iterations) 125 | 126 | 127 | # Final w: [ 3.61341700e+01 5.45002149e+01 2.13992526e+02 1.09001657e+02 128 | # -1.51389834e-03 3.94825208e-01 -9.31372452e-01 -7.19189889e-01 129 | # 3.73256677e-01 4.47409722e-01 2.15583787e-01 3.54025928e+00 130 | # -2.36514711e+00 -1.33926557e+02 -3.50239176e-01 -3.85030823e-01 131 | # 6.86489587e-01 -9.21881175e-01 -5.91052918e-01 -6.89098538e-01 132 | # 3.72997343e+01 6.89626320e+01 2.16316126e+02 -1.45316947e+02 133 | # -5.57393906e-01 -2.76067571e-01 -1.97759353e+00 1.54739454e-01 134 | # 1.26245157e-01 7.73083761e-01 4.00455457e+00] 135 | # Final acc: 0.853801 -------------------------------------------------------------------------------- /optimization/ssgd.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-05-26 21:02:38 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-02 11:49:57 8 | ''' 9 | from sklearn.datasets import load_breast_cancer 10 | import numpy as np 11 | from pyspark.sql import SparkSession 12 | from operator import add 13 | from sklearn.model_selection import train_test_split 14 | from sklearn.metrics import accuracy_score 15 | import matplotlib.pyplot as plt 16 | import sys 17 | import os 18 | 19 | os.environ['PYSPARK_PYTHON'] = sys.executable 20 | 21 | n_threads = 4 # Number of local threads 22 | n_iterations = 1500 # Number of iterations 23 | eta = 0.1 24 | mini_batch_fraction = 0.1 # the fraction of mini batch sample 25 | lam = 0 # coefficient of regular term 26 | 27 | def logistic_f(x, w): 28 | return 1 / (np.exp(-x.dot(w)) + 1) 29 | 30 | 31 | def gradient(point: np.ndarray, w: np.ndarray): 32 | """ Compute linear regression gradient for a matrix of data points 33 | """ 34 | y = point[-1] # point label 35 | x = point[:-1] # point coordinate 36 | # For each point (x, y), compute gradient function, then sum these up 37 | return - (y - logistic_f(x, w)) * x 38 | 39 | 40 | def reg_gradient(w, reg_type="l2", alpha=0): 41 | """ gradient for reg_term 42 | """ 43 | assert(reg_type in ["none", "l2", "l1", "elastic_net"]) 44 | if reg_type == "none": 45 | return 0 46 | elif reg_type == "l2": 47 | return w 48 | elif reg_type == "l1": 49 | return np.sign(w) 50 | else: 51 | return alpha * np.sign(w) + (1 - alpha) * w 52 | 53 | 54 | def draw_acc_plot(accs, n_iterations): 55 | def ewma_smooth(accs, alpha=0.9): 56 | s_accs = np.zeros(n_iterations) 57 | for idx, acc in enumerate(accs): 58 | if idx == 0: 59 | s_accs[idx] = acc 60 | else: 61 | s_accs[idx] = alpha * s_accs[idx-1] + (1 - alpha) * acc 62 | return s_accs 63 | 64 | s_accs = ewma_smooth(accs, alpha=0.9) 65 | plt.plot(np.arange(1, n_iterations + 1), accs, color="C0", alpha=0.3) 66 | plt.plot(np.arange(1, n_iterations + 1), s_accs, color="C0") 67 | plt.title(label="Accuracy on test dataset") 68 | plt.xlabel("Round") 69 | plt.ylabel("Accuracy") 70 | plt.savefig("ssgd_acc_plot.png") 71 | 72 | 73 | if __name__ == "__main__": 74 | 75 | X, y = load_breast_cancer(return_X_y=True) 76 | 77 | D = X.shape[1] 78 | X_train, X_test, y_train, y_test = train_test_split( 79 | X, y, test_size=0.3, random_state=0, shuffle=True) 80 | n_train, n_test = X_train.shape[0], X_test.shape[0] 81 | 82 | spark = SparkSession\ 83 | .builder\ 84 | .appName("SSGD")\ 85 | .master("local[%d]" % n_threads)\ 86 | .getOrCreate() 87 | 88 | matrix = np.concatenate( 89 | [X_train, np.ones((n_train, 1)), y_train.reshape(-1, 1)], axis=1) 90 | 91 | points = spark.sparkContext.parallelize(matrix).cache() 92 | 93 | # Initialize w to a random value 94 | w = 2 * np.random.ranf(size=D + 1) - 1 95 | print("Initial w: " + str(w)) 96 | 97 | accs = [] 98 | for t in range(n_iterations): 99 | print("On iteration %d" % (t + 1)) 100 | w_br = spark.sparkContext.broadcast(w) 101 | 102 | (g, mini_batch_size) = points.sample(False, mini_batch_fraction, 42 + t)\ 103 | .map(lambda point: gradient(point, w_br.value))\ 104 | .treeAggregate( 105 | (0.0, 0),\ 106 | seqOp=lambda res, g: (res[0] + g, res[1] + 1),\ 107 | combOp=lambda res_1, res_2: (res_1[0] + res_2[0], res_1[1] + res_2[1]) 108 | ) 109 | 110 | w -= eta * (g/mini_batch_size + lam * reg_gradient(w, "l2")) 111 | 112 | y_pred = logistic_f(np.concatenate( 113 | [X_test, np.ones((n_test, 1))], axis=1), w) 114 | pred_label = np.where(y_pred < 0.5, 0, 1) 115 | acc = accuracy_score(y_test, pred_label) 116 | accs.append(acc) 117 | print("iterations: %d, accuracy: %f" % (t, acc)) 118 | 119 | print("Final w: %s " % w) 120 | print("Final acc: %f" % acc) 121 | 122 | spark.stop() 123 | 124 | draw_acc_plot(accs, n_iterations) 125 | 126 | 127 | # Final w: [ 3.58216967e+01 4.53599397e+01 2.07040135e+02 8.52414269e+01 128 | # 4.33038042e-01 -2.93986236e-01 1.43286366e-01 -2.95961229e-01 129 | # -7.63362321e-02 -3.93180625e-01 8.19325971e-01 3.30881477e+00 130 | # -3.25867503e+00 -1.24769634e+02 -8.52691792e-01 -5.18037887e-01 131 | # -1.34380402e-01 -7.49316038e-01 -8.76722455e-01 9.23748261e-01 132 | # 3.81531205e+01 5.56880612e+01 2.04895002e+02 -1.17586430e+02 133 | # 8.92355523e-01 -9.40611324e-01 -9.24082612e-01 -1.16210791e+00 134 | # 7.10117706e-01 -7.62921434e-02 4.48389687e+00] 135 | # Final acc: 0.929825 -------------------------------------------------------------------------------- /optimization/ssgd_pytorch.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | import torch 3 | import torch.multiprocessing as mp 4 | from torch.multiprocessing import Barrier 5 | from torchvision import datasets, transforms 6 | from torch.utils.data import Subset 7 | import os 8 | import torch 9 | import torch.optim as optim 10 | import torch.nn.functional as F 11 | import torch.nn as nn 12 | import torch.nn.functional as F 13 | 14 | 15 | batch_size = 64 # input batch size for training 16 | test_batch_size = 1000 # input batch size for testing 17 | epochs = 3 # number of global epochs to train 18 | lr = 0.01 # learning rate 19 | momentum = 0.5 # SGD momentum 20 | seed = 1 # random seed 21 | log_interval = 10 # how many batches to wait before logging training status 22 | n_workers = 4 # how many training processes to use 23 | cuda = True # enables CUDA training 24 | mps = False # enables macOS GPU training 25 | 26 | 27 | class CustomSubset(Subset): 28 | '''A custom subset class with customizable data transformation''' 29 | def __init__(self, dataset, indices, subset_transform=None): 30 | super().__init__(dataset, indices) 31 | self.subset_transform = subset_transform 32 | 33 | def __getitem__(self, idx): 34 | x, y = self.dataset[self.indices[idx]] 35 | if self.subset_transform: 36 | x = self.subset_transform(x) 37 | return x, y 38 | 39 | def __len__(self): 40 | return len(self.indices) 41 | 42 | 43 | def dataset_split(dataset, n_workers): 44 | n_samples = len(dataset) 45 | n_sample_per_workers = n_samples // n_workers 46 | local_datasets = [] 47 | for w_id in range(n_workers): 48 | if w_id < n_workers - 1: 49 | local_datasets.append(CustomSubset(dataset, range(w_id * n_sample_per_workers, (w_id + 1) * n_sample_per_workers))) 50 | else: 51 | local_datasets.append(CustomSubset(dataset, range(w_id * n_sample_per_workers, n_samples))) 52 | return local_datasets 53 | 54 | 55 | def pull_down(global_W, local_Ws, n_workers): 56 | # pull down global model to local 57 | for rank in range(n_workers): 58 | for name, value in local_Ws[rank].items(): 59 | local_Ws[rank][name].data = global_W[name].data 60 | 61 | 62 | def aggregate(global_W, local_Ws, n_workers): 63 | # init the global model 64 | for name, value in global_W.items(): 65 | global_W[name].data = torch.zeros_like(value) 66 | 67 | for rank in range(n_workers): 68 | for name, value in local_Ws[rank].items(): 69 | global_W[name].data += value.data 70 | 71 | for name in local_Ws[rank].keys(): 72 | global_W[name].data /= n_workers 73 | 74 | 75 | class Net(nn.Module): 76 | def __init__(self): 77 | super(Net, self).__init__() 78 | self.conv1 = nn.Conv2d(1, 10, kernel_size=5) 79 | self.conv2 = nn.Conv2d(10, 20, kernel_size=5) 80 | self.conv2_drop = nn.Dropout2d() 81 | self.fc1 = nn.Linear(320, 50) 82 | self.fc2 = nn.Linear(50, 10) 83 | 84 | def forward(self, x): 85 | x = F.relu(F.max_pool2d(self.conv1(x), 2)) 86 | x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) 87 | x = x.view(-1, 320) 88 | x = F.relu(self.fc1(x)) 89 | x = F.dropout(x, training=self.training) 90 | x = self.fc2(x) 91 | return F.log_softmax(x, dim=1) 92 | 93 | 94 | def train_epoch(epoch, rank, local_model, device, dataset, synchronizer, dataloader_kwargs): 95 | torch.manual_seed(seed + rank) 96 | train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs) 97 | optimizer = optim.SGD(local_model.parameters(), lr=lr, momentum=momentum) 98 | 99 | local_model.train() 100 | pid = os.getpid() 101 | for batch_idx, (data, target) in enumerate(train_loader): 102 | optimizer.zero_grad() 103 | output = local_model(data.to(device)) 104 | loss = F.nll_loss(output, target.to(device)) 105 | loss.backward() 106 | optimizer.step() 107 | if batch_idx % log_interval == 0: 108 | print('{}\tTrain Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( 109 | pid, epoch + 1, batch_idx * len(data), len(train_loader.dataset), 110 | 100. * batch_idx / len(train_loader), loss.item())) 111 | 112 | synchronizer.wait() 113 | 114 | 115 | def test(epoch, model, device, dataset, dataloader_kwargs): 116 | torch.manual_seed(seed) 117 | test_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs) 118 | 119 | model.eval() 120 | test_loss = 0 121 | correct = 0 122 | with torch.no_grad(): 123 | for data, target in test_loader: 124 | output = model(data.to(device)) 125 | test_loss += F.nll_loss(output, target.to(device), reduction='sum').item() # sum up batch loss 126 | pred = output.max(1)[1] # get the index of the max log-probability 127 | correct += pred.eq(target.to(device)).sum().item() 128 | 129 | test_loss /= len(test_loader.dataset) 130 | print('\nTest Epoch: {} Global loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( 131 | epoch + 1, test_loss, correct, len(test_loader.dataset), 132 | 100. * correct / len(test_loader.dataset))) 133 | 134 | 135 | if __name__ == "__main__": 136 | use_cuda = cuda and torch.cuda.is_available() 137 | use_mps = mps and torch.backends.mps.is_available() 138 | if use_cuda: 139 | device = torch.device("cuda") 140 | elif use_mps: 141 | device = torch.device("mps") 142 | else: 143 | device = torch.device("cpu") 144 | 145 | transform=transforms.Compose([ 146 | transforms.ToTensor(), 147 | transforms.Normalize((0.1307,), (0.3081,)) 148 | ]) 149 | train_dataset = datasets.MNIST('../data', train=True, download=True, 150 | transform=transform) 151 | test_dataset = datasets.MNIST('../data', train=False, download=True, 152 | transform=transform) 153 | local_train_datasets = dataset_split(train_dataset, n_workers) 154 | 155 | kwargs = {'batch_size': batch_size, 156 | 'shuffle': True} 157 | if use_cuda: 158 | kwargs.update({'num_workers': 1, # num_workers to load data 159 | 'pin_memory': True, 160 | }) 161 | 162 | torch.manual_seed(seed) 163 | mp.set_start_method('spawn', force=True) 164 | # Very important, otherwise CUDA memory cannot be allocated in the child process 165 | 166 | local_models = [Net().to(device) for i in range(n_workers)] 167 | global_model = Net().to(device) 168 | local_Ws = [{key: value for key, value in local_models[i].named_parameters()} for i in range(n_workers)] 169 | global_W = {key: value for key, value in global_model.named_parameters()} 170 | 171 | synchronizer = Barrier(n_workers) 172 | for epoch in range(epochs): 173 | for rank in range(n_workers): 174 | # pull down global model to local 175 | pull_down(global_W, local_Ws, n_workers) 176 | 177 | processes = [] 178 | for rank in range(n_workers): 179 | p = mp.Process(target=train_epoch, args=(epoch, rank, local_models[rank], device, 180 | local_train_datasets[rank], synchronizer, kwargs)) 181 | # We first train the model across `num_processes` processes 182 | p.start() 183 | processes.append(p) 184 | 185 | for p in processes: 186 | p.join() 187 | 188 | aggregate(global_W, local_Ws, n_workers) 189 | 190 | # We test the model each epoch 191 | test(epoch, global_model, device, test_dataset, kwargs) 192 | # Test result for synchronous training:Test Epoch: 3 Global loss: 0.0732, Accuracy: 9796/10000 (98%) 193 | # Test result for asynchronous training:Test Epoch: 3 Global loss: 0.0742, Accuracy: 9789/10000 (98%) 194 | -------------------------------------------------------------------------------- /pic/DistributedML-cover.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/orion-orion/Distributed-ML-PySpark/051790d6bc8d034cfa6af19e7d4f820f4c1fa6d6/pic/DistributedML-cover.jpeg -------------------------------------------------------------------------------- /randomized_algorithm/monte_carlo.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Descripttion: 3 | Version: 1.0 4 | Author: ZhangHongYu 5 | Date: 2022-07-01 21:28:32 6 | LastEditors: ZhangHongYu 7 | LastEditTime: 2022-07-01 21:48:31 8 | ''' 9 | from random import random 10 | from operator import add 11 | from pyspark.sql import SparkSession 12 | import sys 13 | import os 14 | 15 | os.environ['PYSPARK_PYTHON'] = sys.executable 16 | 17 | n_threads = 4 # Number of local threads 18 | # times of sampling 19 | n = 100000 * n_threads 20 | 21 | def is_accept(_: int) -> int: 22 | x = random() * 2 - 1 23 | y = random() * 2 - 1 24 | return 1 if x ** 2 + y ** 2 <= 1 else 0 25 | 26 | if __name__ == "__main__": 27 | spark = SparkSession\ 28 | .builder\ 29 | .appName("monte_carlo")\ 30 | .master("local[%d]" % n_threads)\ 31 | .getOrCreate() 32 | 33 | count = spark.sparkContext.parallelize(range(n)).map(is_accept).reduce(add) 34 | 35 | # equation for the ratio of the area of a circle to a square: count/n = pi/4. 36 | print("Pi is roughly %f" % (4.0 * count / n)) 37 | 38 | spark.stop() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.22.3 2 | matplotlib==3.4.3 3 | scikit-learn==1.1.0 4 | pytorch==1.8.0 5 | torchvision==0.9.0 6 | pyspark==3.3.2 --------------------------------------------------------------------------------