├── .gitignore
├── LICENSE.txt
├── README.md
├── README.rst
├── benchmarking
├── README.md
├── benchmarking.png
├── single_node_multiprocessing.py
└── single_node_serial_computing.py
├── minimalcluster
├── __init__.py
├── master_node.py
└── worker_node.py
├── setup.cfg
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | dist/*
3 | build/*
4 | *.egg-info/*
5 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) [2017] [Xiaodong DENG]
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ***minimalcluster*** - A Minimal Cluster Computing Framework with Python
2 |
3 |
4 |
5 | "***minimal***" here means minimal dependency or platform requirements, as well as its nature of "minimum viable product". It's mainly for tackling straightforward "embarrassingly parallel" tasks using multiple commodity machines, also a good choice for *experimental and learning purposes*. The idea came from [Eli Bendersky's blog](https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing).
6 |
7 | ***minimalcluster*** is built only using plain Python and its standard libraries (mainly `multiprocessing`). This brought a few advantages, including
8 |
9 | - no additional installation or configuration is needed
10 | - 100% cross-platform (you can even have Linux, MacOS, and Windows nodes within a single cluster).
11 |
12 | This package can be used with Python 2.7+ or 3.6+. But within a cluster, you can only choose a single version of Python, either 2 or 3.
13 |
14 | For more frameworks for parallel or cluster computing, you may also want to refer to [Parallel Processing and Multiprocessing in Python](https://wiki.python.org/moin/ParallelProcessing).
15 |
16 |
17 | #### Contents
18 |
19 | - [Benchmarking](#benchmarking)
20 |
21 | - [Usage & Examples](#usage--examples)
22 |
23 |
24 | ## Benchmarking
25 |
26 |
27 |
28 |
29 |
30 | ([Details](https://github.com/XD-DENG/minimalcluster-py/blob/master/benchmarking/README.md))
31 |
32 |
33 | ## Usage & Examples
34 |
35 | #### Step 1 - Install this package
36 |
37 | ```
38 | pip install minimalcluster
39 | ```
40 |
41 | #### Step 2 - Start master node
42 |
43 | Open your Python terminal on your machine which will be used as **Master Node**, and run
44 |
45 | ```python
46 | from minimalcluster import MasterNode
47 |
48 | your_host = '' # or use '0.0.0.0' if you have high enough privilege
49 | your_port=
50 | your_authkey = ''
51 |
52 | master = MasterNode(HOST = your_host, PORT = your_port, AUTHKEY = your_authkey)
53 | master.start_master_server()
54 | ```
55 |
56 | Please note the master node will join the cluster as worker node as well by default. If you prefer otherwise, you can have argument `if_join_as_worker` in `start_master_server()` to be `False`. In addition, you can also remove it from the cluster by invoking `master.stop_as_worker()` and join as worker node again by invoking `master.join_as_worker()`.
57 |
58 | #### Step 3 - Start worker nodes
59 |
60 | On all your **Worker Nodes**, run the command below in your Python terminal
61 |
62 | ```python
63 | from minimalcluster import WorkerNode
64 |
65 | your_host = ''
66 | your_port=
67 | your_authkey = ''
68 | N_processors_to_use =
69 |
70 | worker = WorkerNode(your_host, your_port, your_authkey, nprocs = N_processors_to_use)
71 |
72 | worker.join_cluster()
73 |
74 | ```
75 |
76 | Note: if your `nprocs` is bigger than the number of processors on your machine, it will be changed to be the number of processors.
77 |
78 | After the operations on the worker nodes, you can go back to your **Master node** and check the list of connected **Worker nodes**.
79 |
80 | ```python
81 | master.list_workers()
82 | ```
83 |
84 |
85 | #### Step 4 - Prepare environment to share with worker nodes
86 |
87 | We need to specify the task function (as well as its potential dependencies) and the arguments to share with worker nodes, including
88 |
89 | - **Environment**: The `environment` is simply the codes that's going to run on worker nodes. There are two ways to set up environment. The first one is to prepare a separate `.py` file as environment file and declare all the functions you need inside, then use `master.load_envir('')` to load the environment. Another way is for simple cases. You can use `master.load_envir('', from_file = False)` to load the environment, for example `master.load_envir("f = lambda x: x * 2", from_file = False)`.
90 |
91 | - **Task Function**: We need to register the task function using `master.register_target_function('')`, like `master.register_target_function("f")`. Please note the task function itself must be declared in the environment file or statement.
92 |
93 | - **Arguments**: The argument must be a list. It will be passed to the task function. Usage: `master.load_args(args)`. **Note the elements in list `args` must be unique.**
94 |
95 |
96 | #### Step 5 - Submit jobs
97 |
98 | Now your cluster is ready. you can try the examples below in your Python terminal on your **Master node**.
99 |
100 | ##### Example 1 - Estimate value of Pi
101 |
102 | ```python
103 | envir_statement = '''
104 | from random import random
105 | example_pi_estimate_throw = lambda x: 1 if (random() * 2 - 1)**2 + (random() * 2 - 1)**2 < 1 else 0
106 | '''
107 | master.load_envir(envir_statement, from_file = False)
108 | master.register_target_function("example_pi_estimate_throw")
109 |
110 | N = int(1e6)
111 | master.load_args(range(N))
112 |
113 | result = master.execute()
114 |
115 | print("Pi is roughly %f" % (4.0 * sum([x2 for x1, x2 in result.items()]) / N))
116 | ```
117 |
118 | ##### Example 2 - Factorization
119 |
120 | ```python
121 | envir_statement = '''
122 | # A naive factorization method. Take integer 'n', return list of factors.
123 | # Ref: https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing
124 | def example_factorize_naive(n):
125 | if n < 2:
126 | return []
127 | factors = []
128 | p = 2
129 | while True:
130 | if n == 1:
131 | return factors
132 | r = n % p
133 | if r == 0:
134 | factors.append(p)
135 | n = n / p
136 | elif p * p >= n:
137 | factors.append(n)
138 | return factors
139 | elif p > 2:
140 | p += 2
141 | else:
142 | p += 1
143 | assert False, "unreachable"
144 | '''
145 |
146 | #Create N large numbers to factorize.
147 | def make_nums(N):
148 | nums = [999999999999]
149 | for i in range(N):
150 | nums.append(nums[-1] + 2)
151 | return nums
152 |
153 | master.load_args(make_nums(5000))
154 | master.load_envir(envir_statement, from_file = False)
155 | master.register_target_function("example_factorize_naive")
156 |
157 | result = master.execute()
158 |
159 | for x in result.items()[:10]: # if running on Python 3, use `list(result.items())` rather than `result.items()`
160 | print(x)
161 | ```
162 |
163 | ##### Example 3 - Feed multiple arguments to target function
164 |
165 | It's possible that you need to feed multiple arguments to target function. A small trick will be needed here: you need to wrap your arguments into a tuple, then pass the tuple to the target function as a "single" argument. Within your argument function, you can "unzip" this tuple and obtain your arguments.
166 |
167 | ```python
168 | envir_statement = '''
169 | f = lambda x:x[0]+x[1]
170 | '''
171 | master.load_envir(envir_statement, from_file = False)
172 | master.register_target_function("f")
173 |
174 | master.load_args([(1,2), (3,4), (5, 6), (7, 8)])
175 |
176 | result = master.execute()
177 |
178 | print(result)
179 | ```
180 |
181 | #### Step 6 - Shutdown the cluster
182 |
183 | You can shutdown the cluster by running
184 |
185 | ```python
186 | master.shutdown()
187 | ```
188 |
189 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | ====================================================================
2 | *minimalcluster* - A Minimal Cluster Computing Framework with Python
3 | ====================================================================
4 |
5 | "**minimal**" here means minimal dependency or platform requirements, as well as its nature of "minimum viable product". It's mainly for tackling straightforward "embarrassingly parallel" tasks using multiple commodity machines, also a good choice for experimental and learning purposes. The idea came from `Eli Bendersky's blog `_
6 | .
7 |
8 | **minimalcluster** is built only using plain Python and its standard libraries (mainly *multiprocessing*). This brings a few advantages, including
9 |
10 | - no additional installation or configuration is needed
11 |
12 | - 100% cross-platform (you can even have Linux, MacOS, and Windows nodes within a single cluster).
13 |
14 | This package can be used with Python 2.7+ or 3.6+. But within a cluster, you can only choose a single version of Python, either 2 or 3.
15 |
16 | For more frameworks for parallel or cluster computing, you may also want to refer to `Parallel Processing and Multiprocessing in Python `_
17 | .
18 |
19 |
20 | ******************
21 | Benchmarking
22 | ******************
23 |
24 | .. image:: https://raw.githubusercontent.com/XD-DENG/minimalcluster-py/master/benchmarking/benchmarking.png
25 |
26 |
27 | `Details `_
28 |
29 | ******************
30 | Usage & Examples
31 | ******************
32 |
33 | Step 1 - Install this package
34 | =============================
35 |
36 | .. code-block:: bash
37 |
38 | pip install minimalcluster
39 |
40 | Step 2 - Start master node
41 | =============================
42 | Open your Python terminal on your machine which will be used as **Master Node**, and run
43 |
44 | .. code:: python
45 |
46 | from minimalcluster import MasterNode
47 |
48 | your_host = ''
49 | # or use '0.0.0.0' if you have high enough privilege
50 | your_port=
51 | your_authkey = ''
52 |
53 | master = MasterNode(HOST = your_host, PORT = your_port, AUTHKEY = your_authkey)
54 | master.start_master_server()
55 |
56 |
57 | Please note the master node will join the cluster as worker node as well by default. If you prefer otherwise, you can have argument *if_join_as_worker* in *start_master_server()* to be *False*. In addition, you can also remove it from the cluster by invoking *master.stop_as_worker()* and join as worker node again by invoking *master.join_as_worker()*.
58 |
59 |
60 | Step 3 - Start worker nodes
61 | =============================
62 |
63 | On all your **Worker Nodes**, run the command below in your Python terminal
64 |
65 | .. code:: python
66 |
67 | from minimalcluster import WorkerNode
68 |
69 | your_host = ''
70 | your_port=
71 | your_authkey = ''
72 | N_processors_to_use =
73 |
74 | worker = WorkerNode(your_host, your_port, your_authkey, nprocs = N_processors_to_use)
75 |
76 | worker.join_cluster()
77 |
78 | Note: if your nprocs is bigger than the number of processors on your machine, it will be changed to be the number of processors.
79 |
80 | After the operations on the worker nodes, you can go back to your Master node and check the list of connected Worker nodes.
81 |
82 | .. code:: python
83 |
84 | master.list_workers()
85 |
86 |
87 | Step 4 - Prepare environment to share with worker nodes
88 | =======================================================
89 |
90 | We need to specify the task function (as well as its potential dependencies) and the arguments to share with worker nodes, including
91 |
92 | **Environment**: The environment is simply the codes that's going to run on worker nodes. There are two ways to set up environment. The first one is to prepare a separate .py file as environment file and declare all the functions you need inside, then use *master.load_envir('')* to load the environment. Another way is for simple cases. You can use *master.load_envir('', from_file = False)* to load the environment, for example *master.load_envir("f = lambda x: x * 2", from_file = False)*.
93 |
94 | **Task Function**: We need to register the task function using *master.register_target_function('')*, like *master.register_target_function("f")*. Please note the task function itself must be declared in the environment file or statement.
95 |
96 | **Arguments**: The argument must be a list. It will be passed to the task function. Usage: *master.load_args(args)*. **Note the elements in list args must be unique.**
97 |
98 | Step 5 - Submit jobs
99 | ====================
100 |
101 | Now your cluster is ready. you can try the examples below in your Python terminal on your Master node.
102 |
103 | Example 1 - Estimate value of Pi
104 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
105 |
106 | .. code:: python
107 |
108 | envir_statement = '''
109 | from random import random
110 | example_pi_estimate_throw = lambda x: 1 if (random() * 2 - 1)**2 + (random() * 2 - 1)**2 < 1 else 0
111 | '''
112 | master.load_envir(envir_statement, from_file = False)
113 | master.register_target_function("example_pi_estimate_throw")
114 |
115 | N = int(1e6)
116 | master.load_args(range(N))
117 |
118 | result = master.execute()
119 |
120 | print("Pi is roughly %f" % (4.0 * sum([x2 for x1, x2 in result.items()]) / N))
121 |
122 |
123 | Example 2 - Factorization
124 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125 |
126 | .. code:: python
127 |
128 | envir_statement = '''
129 | # A naive factorization method. Take integer 'n', return list of factors.
130 | # Ref: https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing
131 | def example_factorize_naive(n):
132 | if n < 2:
133 | return []
134 | factors = []
135 | p = 2
136 | while True:
137 | if n == 1:
138 | return factors
139 | r = n % p
140 | if r == 0:
141 | factors.append(p)
142 | n = n / p
143 | elif p * p >= n:
144 | factors.append(n)
145 | return factors
146 | elif p > 2:
147 | p += 2
148 | else:
149 | p += 1
150 | assert False, "unreachable"
151 | '''
152 |
153 | #Create N large numbers to factorize.
154 | def make_nums(N):
155 | nums = [999999999999]
156 | for i in range(N):
157 | nums.append(nums[-1] + 2)
158 | return nums
159 |
160 | master.load_args(make_nums(5000))
161 | master.load_envir(envir_statement, from_file = False)
162 | master.register_target_function("example_factorize_naive")
163 |
164 | result = master.execute()
165 |
166 | for x in result.items()[:10]: # if running on Python 3, use `list(result.items())` rather than `result.items()`
167 | print(x)
168 |
169 | Example 3 - Feed multiple arguments to target function
170 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
171 |
172 | It's possible that you need to feed multiple arguments to target function. A small trick will be needed here: you need to wrap your arguments into a tuple, then pass the tuple to the target function as a "single" argument. Within your argument function, you can "unzip" this tuple and obtain your arguments.
173 |
174 | .. code:: python
175 |
176 | envir_statement = '''
177 | f = lambda x:x[0]+x[1]
178 | '''
179 | master.load_envir(envir_statement, from_file = False)
180 | master.register_target_function("f")
181 |
182 | master.load_args([(1,2), (3,4), (5, 6), (7, 8)])
183 |
184 | result = master.execute()
185 |
186 | print(result)
187 |
188 | Step 6 - Shutdown the cluster
189 | ==============================
190 |
191 | You can shutdown the cluster by running
192 |
193 | .. code:: python
194 |
195 | master.shutdown()
196 |
197 |
198 |
199 |
--------------------------------------------------------------------------------
/benchmarking/README.md:
--------------------------------------------------------------------------------
1 |
2 | ## Sample Problem to Tackle
3 |
4 | We try to factorize some given big integers, using a naive factorization method.
5 |
6 | ```python
7 | def example_factorize_naive(n):
8 | if n < 2:
9 | return []
10 | factors = []
11 | p = 2
12 | while True:
13 | if n == 1:
14 | return factors
15 | r = n % p
16 | if r == 0:
17 | factors.append(p)
18 | n = n / p
19 | elif p * p >= n:
20 | factors.append(n)
21 | return factors
22 | elif p > 2:
23 | p += 2
24 | else:
25 | p += 1
26 | assert False, "unreachable"
27 | ```
28 |
29 | The big integers are generated using the codes below
30 |
31 | ```python
32 | def make_nums(N):
33 | nums = [999999999999]
34 | for i in range(N):
35 | nums.append(nums[-1] + 2)
36 | return nums
37 |
38 | N = 20000
39 | big_ints = make_nums(N)
40 | ```
41 |
42 |
43 | ## Different Methods and Results
44 |
45 | Serial computing using single processor (`single_node_serial_computing.py`): ~180 seconds
46 |
47 | Parallel computing using multiple processors on a single machine (`single_node_multiprocessing.py`): ~47 seconds
48 |
49 | minimalcluster - single node (4 processors X 1): ~47 seconds
50 |
51 | minimalcluster - two node (4 processors X 2): ~26 seconds
52 |
53 | minimalcluster - three node (4 processors X 3): ~19 seconds
54 |
55 |
56 |
57 | ## Specs
58 |
59 | The benchmarking was done on three virtual machines on DigitalOcean.
60 |
61 | ### Network
62 |
63 | Using the normal network connection among virtual machines. The ping time is about 0.35 to 0.40 ms.
64 |
65 | ### CPU Info
66 |
67 | ```
68 | Architecture: x86_64
69 | CPU(s): 4
70 | Vendor ID: GenuineIntel
71 | CPU family: 6
72 | Model: 85
73 | Model name: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
74 | CPU MHz: 2693.658
75 | ```
76 |
77 | ### Software
78 |
79 | - Python 2.7.5
80 | - minimalcluster 0.1.0.dev5
--------------------------------------------------------------------------------
/benchmarking/benchmarking.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/XD-DENG/minimalcluster-py/4c57908d93a5d5f8a3b0e72fa6275c49871183d6/benchmarking/benchmarking.png
--------------------------------------------------------------------------------
/benchmarking/single_node_multiprocessing.py:
--------------------------------------------------------------------------------
1 | import time
2 | from multiprocessing import Pool
3 |
4 | # A naive factorization method. Take integer 'n', return list of factors.
5 | # Ref: https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing
6 | def example_factorize_naive(n):
7 | if n < 2:
8 | return []
9 | factors = []
10 | p = 2
11 | while True:
12 | if n == 1:
13 | return factors
14 | r = n % p
15 | if r == 0:
16 | factors.append(p)
17 | n = n / p
18 | elif p * p >= n:
19 | factors.append(n)
20 | return factors
21 | elif p > 2:
22 | p += 2
23 | else:
24 | p += 1
25 | assert False, "unreachable"
26 |
27 |
28 | def make_nums(N):
29 | nums = [999999999999]
30 | for i in range(N):
31 | nums.append(nums[-1] + 2)
32 | return nums
33 |
34 | N = 20000
35 | big_ints = make_nums(N)
36 |
37 | t_start = time.time()
38 | p = Pool(4)
39 | result = p.map(example_factorize_naive, big_ints)
40 | t_end = time.time()
41 |
42 | print("[Single-Node Parallel Computing (4 Cores)] Lapse: {} seconds".format(t_end - t_start))
--------------------------------------------------------------------------------
/benchmarking/single_node_serial_computing.py:
--------------------------------------------------------------------------------
1 | import time
2 |
3 | # A naive factorization method. Take integer 'n', return list of factors.
4 | # Ref: https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing
5 | def example_factorize_naive(n):
6 | if n < 2:
7 | return []
8 | factors = []
9 | p = 2
10 | while True:
11 | if n == 1:
12 | return factors
13 | r = n % p
14 | if r == 0:
15 | factors.append(p)
16 | n = n / p
17 | elif p * p >= n:
18 | factors.append(n)
19 | return factors
20 | elif p > 2:
21 | p += 2
22 | else:
23 | p += 1
24 | assert False, "unreachable"
25 |
26 |
27 | def make_nums(N):
28 | nums = [999999999999]
29 | for i in range(N):
30 | nums.append(nums[-1] + 2)
31 | return nums
32 |
33 | N = 20000
34 | big_ints = make_nums(N)
35 |
36 | t_start = time.time()
37 | result = map(example_factorize_naive, big_ints)
38 | t_end = time.time()
39 |
40 | print("[Single-Node Serial Computing (Single Core)] Lapse: {} seconds".format(t_end - t_start))
--------------------------------------------------------------------------------
/minimalcluster/__init__.py:
--------------------------------------------------------------------------------
1 |
2 | __all__ = ['MasterNode', "WorkerNode"]
3 |
4 |
5 | import sys
6 | if sys.version_info.major == 3:
7 | from .master_node import MasterNode
8 | from .worker_node import WorkerNode
9 | else:
10 | from master_node import MasterNode
11 | from worker_node import WorkerNode
12 |
--------------------------------------------------------------------------------
/minimalcluster/master_node.py:
--------------------------------------------------------------------------------
1 | from multiprocessing.managers import SyncManager, DictProxy
2 | from multiprocessing import Process, cpu_count
3 | import os, signal, sys, time, datetime, random, string, inspect
4 | from functools import partial
5 | from types import FunctionType
6 | from socket import getfqdn
7 | if sys.version_info.major == 3:
8 | from queue import Queue as _Queue
9 | else:
10 | from Queue import Queue as _Queue
11 |
12 |
13 |
14 | __all__ = ['MasterNode']
15 |
16 |
17 |
18 | # Make Queue.Queue pickleable
19 | # Ref: https://stackoverflow.com/questions/25631266/cant-pickle-class-main-jobqueuemanager
20 | class Queue(_Queue):
21 | """ A picklable queue. """
22 | def __getstate__(self):
23 | # Only pickle the state we care about
24 | return (self.maxsize, self.queue, self.unfinished_tasks)
25 |
26 | def __setstate__(self, state):
27 | # Re-initialize the object, then overwrite the default state with
28 | # our pickled state.
29 | Queue.__init__(self)
30 | self.maxsize = state[0]
31 | self.queue = state[1]
32 | self.unfinished_tasks = state[2]
33 |
34 | # prepare for using functools.partial()
35 | def get_fun(fun):
36 | return fun
37 |
38 |
39 |
40 | class JobQueueManager(SyncManager):
41 | pass
42 |
43 |
44 | def clear_queue(q):
45 | while not q.empty():
46 | q.get()
47 |
48 |
49 | def start_worker_in_background(HOST, PORT, AUTHKEY, nprocs, quiet):
50 | from minimalcluster import WorkerNode
51 | worker = WorkerNode(HOST, PORT, AUTHKEY, nprocs, quiet)
52 | worker.join_cluster()
53 |
54 |
55 | class MasterNode():
56 |
57 | def __init__(self, HOST = '127.0.0.1', PORT = 8888, AUTHKEY = None, chunksize = 50):
58 | '''
59 | Method to initiate a master node object.
60 |
61 | HOST: the hostname or IP address to use
62 | PORT: the port to use
63 | AUTHKEY: The process's authentication key (a string or byte string).
64 | If None is given, a random string will be given
65 | chunksize: The numbers are split into chunks. Each chunk is pushed into the job queue.
66 | Here the size of each chunk if specified.
67 | '''
68 |
69 | # Check & process AUTHKEY
70 | # to [1] ensure compatilibity between Py 2 and 3; [2] to allow both string and byte string for AUTHKEY input.
71 | assert type(AUTHKEY) in [str, bytes] or AUTHKEY is None, "AUTKEY must be either one among string, byte string, and None (a random AUTHKEY will be generated if None is given)."
72 | if AUTHKEY != None and type(AUTHKEY) == str:
73 | AUTHKEY = AUTHKEY.encode()
74 |
75 | self.HOST = HOST
76 | self.PORT = PORT
77 | self.AUTHKEY = AUTHKEY if AUTHKEY != None else ''.join(random.choice(string.ascii_uppercase) for _ in range(6)).encode()
78 | self.chunksize = chunksize
79 | self.server_status = 'off'
80 | self.as_worker = False
81 | self.target_fun = None
82 | self.master_fqdn = getfqdn()
83 | self.pid_as_worker_on_master = None
84 |
85 |
86 | def join_as_worker(self):
87 | '''
88 | This method helps start the master node as a worker node as well
89 | '''
90 | if self.as_worker:
91 | print("[WARNING] This node has already joined the cluster as a worker node.")
92 | else:
93 | self.process_as_worker = Process(target = start_worker_in_background, args=(self.HOST, self.PORT, self.AUTHKEY, cpu_count(), True, ))
94 | self.process_as_worker.start()
95 |
96 | # waiting for the master node joining the cluster as a worker
97 | while self.master_fqdn not in [w[0] for w in self.list_workers()]:
98 | pass
99 |
100 | self.pid_as_worker_on_master = [w for w in self.list_workers() if w[0] == self.master_fqdn][0][2]
101 | self.as_worker = True
102 | print("[INFO] Current node has joined the cluster as a Worker Node (using {} processors; Process ID: {}).".format(cpu_count(), self.process_as_worker.pid))
103 |
104 | def start_master_server(self, if_join_as_worker = True):
105 | """
106 | Method to create a manager as the master node.
107 |
108 | Arguments:
109 | if_join_as_worker: Boolen.
110 | If True, the master node will also join the cluster as worker node. It will automatically run in background.
111 | If False, users need to explicitly configure if they want the master node to work as worker node too.
112 | The default value is True.
113 | """
114 | self.job_q = Queue()
115 | self.result_q = Queue()
116 | self.error_q = Queue()
117 | self.get_envir = Queue()
118 | self.target_function = Queue()
119 | self.raw_queue_of_worker_list = Queue()
120 | self.raw_dict_of_job_history = dict() # a queue to store the history of job assignment
121 |
122 | # Return synchronized proxies for the actual Queue objects.
123 | # Note that for "callable=", we don't use `lambda` which is commonly used in multiprocessing examples.
124 | # Instead, we use `partial()` to wrapper one more time.
125 | # This is to avoid "pickle.PicklingError" on Windows platform. This helps the codes run on both Windows and Linux/Mac OS.
126 | # Ref: https://stackoverflow.com/questions/25631266/cant-pickle-class-main-jobqueuemanager
127 |
128 | JobQueueManager.register('get_job_q', callable=partial(get_fun, self.job_q))
129 | JobQueueManager.register('get_result_q', callable=partial(get_fun, self.result_q))
130 | JobQueueManager.register('get_error_q', callable=partial(get_fun, self.error_q))
131 | JobQueueManager.register('get_envir', callable = partial(get_fun, self.get_envir))
132 | JobQueueManager.register('target_function', callable = partial(get_fun, self.target_function))
133 | JobQueueManager.register('queue_of_worker_list', callable = partial(get_fun, self.raw_queue_of_worker_list))
134 | JobQueueManager.register('dict_of_job_history', callable = partial(get_fun, self.raw_dict_of_job_history), proxytype=DictProxy)
135 |
136 | self.manager = JobQueueManager(address=(self.HOST, self.PORT), authkey=self.AUTHKEY)
137 | self.manager.start()
138 | self.server_status = 'on'
139 | print('[{}] Master Node started at {}:{} with authkey `{}`.'.format(str(datetime.datetime.now()), self.HOST, self.PORT, self.AUTHKEY.decode()))
140 |
141 | self.shared_job_q = self.manager.get_job_q()
142 | self.shared_result_q = self.manager.get_result_q()
143 | self.shared_error_q = self.manager.get_error_q()
144 | self.share_envir = self.manager.get_envir()
145 | self.share_target_fun = self.manager.target_function()
146 | self.queue_of_worker_list = self.manager.queue_of_worker_list()
147 | self.dict_of_job_history = self.manager.dict_of_job_history()
148 |
149 | if if_join_as_worker:
150 | self.join_as_worker()
151 |
152 | def stop_as_worker(self):
153 | '''
154 | Given the master node can also join the cluster as a worker, we also need to have a method to stop it as a worker node (which may be necessary in some cases).
155 | This method serves this purpose.
156 |
157 | Given the worker node will start a separate process for heartbeat purpose.
158 | We need to shutdown the heartbeat process separately.
159 | '''
160 | try:
161 | os.kill(self.pid_as_worker_on_master, signal.SIGTERM)
162 | self.pid_as_worker_on_master = None
163 | self.process_as_worker.terminate()
164 | except AttributeError:
165 | print("[WARNING] The master node has not started as a worker yet.")
166 | finally:
167 | self.as_worker = False
168 | print("[INFO] The master node has stopped working as a worker node.")
169 |
170 | def list_workers(self):
171 | '''
172 | Return a list of connected worker nodes.
173 | Each element of this list is:
174 | (hostname of worker node,
175 | # of available cores,
176 | pid of heartbeat process on the worker node,
177 | if the worker node is working on any work load currently (1:Yes, 0:No))
178 | '''
179 |
180 | # STEP-1: an element will be PUT into the queue "self.queue_of_worker_list"
181 | # STEP-2: worker nodes will watch on this queue and attach their information into this queue too
182 | # STEP-3: this function will collect the elements from the queue and return the list of workers node who responded
183 |
184 | self.queue_of_worker_list.put(".") # trigger worker nodes to contact master node to show their "heartbeat"
185 | time.sleep(0.3) # Allow some time for collecting "heartbeat"
186 |
187 | worker_list = []
188 | while not self.queue_of_worker_list.empty():
189 | worker_list.append(self.queue_of_worker_list.get())
190 |
191 | return list(set([w for w in worker_list if w != "."]))
192 |
193 |
194 | def load_envir(self, source, from_file = True):
195 | if from_file:
196 | with open(source, 'r') as f:
197 | self.envir_statements = "".join(f.readlines())
198 | else:
199 | self.envir_statements = source
200 |
201 | def register_target_function(self, fun_name):
202 | self.target_fun = fun_name
203 |
204 | def load_args(self, args):
205 | '''
206 | args should be a list
207 | '''
208 | self.args_to_share_to_workers = args
209 |
210 |
211 | def __check_target_function(self):
212 |
213 | try:
214 | exec(self.envir_statements)
215 | except:
216 | print("[ERROR] The environment statements given can't be executed.")
217 | raise
218 |
219 | if self.target_fun in locals() and isinstance(locals()[self.target_fun], FunctionType):
220 | return True
221 | else:
222 | return False
223 |
224 |
225 | def execute(self):
226 |
227 | # Ensure the error queue is empty
228 | clear_queue(self.shared_error_q)
229 |
230 | if self.target_fun == None:
231 | print("[ERROR] Target function is not registered yet.")
232 | elif not self.__check_target_function():
233 | print("[ERROR] The target function registered (`{}`) can't be built with the given environment statements.".format(self.target_fun))
234 | elif len(self.args_to_share_to_workers) != len(set(self.args_to_share_to_workers)):
235 | print("[ERROR]The arguments to share with worker nodes are not unique. Please check the data you passed to MasterNode.load_args().")
236 | elif len(self.list_workers()) == 0:
237 | print("[ERROR] No worker node is available. Can't proceed to execute")
238 | else:
239 | print("[{}] Assigning jobs to worker nodes.".format(str(datetime.datetime.now())))
240 |
241 |
242 | self.share_envir.put(self.envir_statements)
243 |
244 | self.share_target_fun.put(self.target_fun)
245 |
246 | # The numbers are split into chunks. Each chunk is pushed into the job queue
247 | for i in range(0, len(self.args_to_share_to_workers), self.chunksize):
248 | self.shared_job_q.put((i, self.args_to_share_to_workers[i:(i + self.chunksize)]))
249 |
250 | # Wait until all results are ready in shared_result_q
251 | numresults = 0
252 | resultdict = {}
253 | list_job_id_done = []
254 | while numresults < len(self.args_to_share_to_workers):
255 |
256 | if len(self.list_workers()) == 0:
257 | print("[{}][Warning] No valid worker node at this moment. You can wait for workers to join, or CTRL+C to cancle.".format(str(datetime.datetime.now())))
258 | continue
259 |
260 | if self.shared_job_q.empty() and sum([w[3] for w in self.list_workers()]) == 0:
261 | '''
262 | After all jobs are assigned and all worker nodes have finished their works,
263 | check if the nodes who have un-finished jobs are sitll alive.
264 | if not, re-collect these jobs and put them inot the job queue
265 | '''
266 | while not self.shared_result_q.empty():
267 | try:
268 | job_id_done, outdict = self.shared_result_q.get(False)
269 | resultdict.update(outdict)
270 | list_job_id_done.append(job_id_done)
271 | numresults += len(outdict)
272 | except:
273 | pass
274 |
275 | [self.dict_of_job_history.pop(k, None) for k in list_job_id_done]
276 |
277 | for job_id in [x for x,y in self.dict_of_job_history.items()]:
278 | print("Putting {} back to the job queue".format(job_id))
279 | self.shared_job_q.put((job_id, self.args_to_share_to_workers[job_id:(job_id + self.chunksize)]))
280 |
281 | if not self.shared_error_q.empty():
282 | print("[ERROR] Running error occured in remote worker node:")
283 | print(self.shared_error_q.get())
284 |
285 | clear_queue(self.shared_job_q)
286 | clear_queue(self.shared_result_q)
287 | clear_queue(self.share_envir)
288 | clear_queue(self.share_target_fun)
289 | clear_queue(self.shared_error_q)
290 | self.dict_of_job_history.clear()
291 |
292 | return None
293 |
294 | # job_id_done is the unique id of the jobs that have been done and returned to the master node.
295 | while not self.shared_result_q.empty():
296 | try:
297 | job_id_done, outdict = self.shared_result_q.get(False)
298 | resultdict.update(outdict)
299 | list_job_id_done.append(job_id_done)
300 | numresults += len(outdict)
301 | except:
302 | pass
303 |
304 |
305 | print("[{}] Aggregating on Master node...".format(str(datetime.datetime.now())))
306 |
307 | # After the execution is done, empty all the args & task function queues
308 | # to prepare for the next execution
309 | clear_queue(self.shared_job_q)
310 | clear_queue(self.shared_result_q)
311 | clear_queue(self.share_envir)
312 | clear_queue(self.share_target_fun)
313 | self.dict_of_job_history.clear()
314 |
315 | return resultdict
316 |
317 |
318 | def shutdown(self):
319 | if self.as_worker:
320 | self.stop_as_worker()
321 |
322 | if self.server_status == 'on':
323 | self.manager.shutdown()
324 | self.server_status = "off"
325 | print("[INFO] The master node is shut down.")
326 | else:
327 | print("[WARNING] The master node is not started yet or already shut down.")
328 |
--------------------------------------------------------------------------------
/minimalcluster/worker_node.py:
--------------------------------------------------------------------------------
1 | from multiprocessing.managers import SyncManager
2 | import multiprocessing
3 | import sys, os, time, datetime
4 | from socket import getfqdn
5 | if sys.version_info.major == 3:
6 | import queue as Queue
7 | else:
8 | import Queue
9 |
10 | __all__ = ['WorkerNode']
11 |
12 | def single_worker(envir, fun, job_q, result_q, error_q, history_d, hostname):
13 | """ A worker function to be launched in a separate process. Takes jobs from
14 | job_q - each job a list of numbers to factorize. When the job is done,
15 | the result (dict mapping number -> list of factors) is placed into
16 | result_q. Runs until job_q is empty.
17 | """
18 |
19 | # Reference:
20 | #https://stackoverflow.com/questions/4484872/why-doesnt-exec-work-in-a-function-with-a-subfunction
21 | exec(envir) in locals()
22 | globals().update(locals())
23 | while True:
24 | try:
25 | job_id, job_detail = job_q.get_nowait()
26 | # history_q.put({job_id: hostname})
27 | history_d[job_id] = hostname
28 | outdict = {n: globals()[fun](n) for n in job_detail}
29 | result_q.put((job_id, outdict))
30 | except Queue.Empty:
31 | return
32 | except:
33 | # send the Unexpected error to master node
34 | error_q.put("Worker Node '{}': ".format(hostname) + "; ".join([repr(e) for e in sys.exc_info()]))
35 | return
36 |
37 | def mp_apply(envir, fun, shared_job_q, shared_result_q, shared_error_q, shared_history_d, hostname, nprocs):
38 | """ Split the work with jobs in shared_job_q and results in
39 | shared_result_q into several processes. Launch each process with
40 | single_worker as the worker function, and wait until all are
41 | finished.
42 | """
43 |
44 | procs = []
45 | for i in range(nprocs):
46 | p = multiprocessing.Process(
47 | target=single_worker,
48 | args=(envir, fun, shared_job_q, shared_result_q, shared_error_q, shared_history_d, hostname))
49 | procs.append(p)
50 | p.start()
51 |
52 | for p in procs:
53 | p.join()
54 |
55 | # this function is put at top level rather than as a method of WorkerNode class
56 | # this is to bypass the error "AttributeError: type object 'ServerQueueManager' has no attribute 'from_address'""
57 | def heartbeat(queue_of_worker_list, worker_hostname, nprocs, status):
58 | '''
59 | heartbeat will keep an eye on whether the master node is checking the list of valid nodes
60 | if it detects the signal, it will share the information of current node with the master node.
61 | '''
62 | while True:
63 | if not queue_of_worker_list.empty():
64 | queue_of_worker_list.put((worker_hostname, nprocs, os.getpid(), status.value))
65 | time.sleep(0.01)
66 |
67 | class WorkerNode():
68 |
69 | def __init__(self, IP, PORT, AUTHKEY, nprocs, quiet = False):
70 | '''
71 | Method to initiate a master node object.
72 |
73 | IP: the hostname or IP address of the Master Node
74 | PORT: the port to use (decided by Master NOde)
75 | AUTHKEY: The process's authentication key (a string or byte string).
76 | It can't be None for Worker Nodes.
77 | nprocs: Integer. The number of processors on the Worker Node to be available to the Master Node.
78 | It should be less or equal to the number of processors on the Worker Node. If higher than that, the # of available processors will be used instead.
79 | '''
80 |
81 | assert type(AUTHKEY) in [str, bytes], "AUTHKEY must be either string or byte string."
82 | assert type(nprocs) == int, "'nprocs' must be an integer."
83 |
84 | self.IP = IP
85 | self.PORT = PORT
86 | self.AUTHKEY = AUTHKEY.encode() if type(AUTHKEY) == str else AUTHKEY
87 | N_local_cores = multiprocessing.cpu_count()
88 | if nprocs > N_local_cores:
89 | print("[WARNING] nprocs specified is more than the # of cores of this node. Using the # of cores ({}) instead.".format(N_local_cores))
90 | self.nprocs = N_local_cores
91 | elif nprocs < 1:
92 | print("[WARNING] nprocs specified is not valid. Using the # of cores ({}) instead.".format(N_local_cores))
93 | self.nprocs = N_local_cores
94 | else:
95 | self.nprocs = nprocs
96 | self.connected = False
97 | self.worker_hostname = getfqdn()
98 | self.quiet = quiet
99 | self.working_status = multiprocessing.Value("i", 0) # if the node is working on any work loads
100 |
101 | def connect(self):
102 | """
103 | Connect to Master Node after the Worker Node is initialized.
104 | """
105 | class ServerQueueManager(SyncManager):
106 | pass
107 |
108 | ServerQueueManager.register('get_job_q')
109 | ServerQueueManager.register('get_result_q')
110 | ServerQueueManager.register('get_error_q')
111 | ServerQueueManager.register('get_envir')
112 | ServerQueueManager.register('target_function')
113 | ServerQueueManager.register('queue_of_worker_list')
114 | ServerQueueManager.register('dict_of_job_history')
115 |
116 | self.manager = ServerQueueManager(address=(self.IP, self.PORT), authkey=self.AUTHKEY)
117 |
118 | try:
119 | if not self.quiet:
120 | print('[{}] Building connection to {}:{}'.format(str(datetime.datetime.now()), self.IP, self.PORT))
121 | self.manager.connect()
122 | if not self.quiet:
123 | print('[{}] Client connected to {}:{}'.format(str(datetime.datetime.now()), self.IP, self.PORT))
124 | self.connected = True
125 | self.job_q = self.manager.get_job_q()
126 | self.result_q = self.manager.get_result_q()
127 | self.error_q = self.manager.get_error_q()
128 | self.envir_to_use = self.manager.get_envir()
129 | self.target_func = self.manager.target_function()
130 | self.queue_of_worker_list = self.manager.queue_of_worker_list()
131 | self.dict_of_job_history = self.manager.dict_of_job_history()
132 | except:
133 | print("[ERROR] No connection could be made. Please check the network or your configuration.")
134 |
135 | def join_cluster(self):
136 | """
137 | This method will connect the worker node with the master node, and start to listen to the master node for any job assignment.
138 | """
139 |
140 | self.connect()
141 |
142 | if self.connected:
143 |
144 | # start the `heartbeat` process so that the master node can always know if this node is still connected.
145 | self.heartbeat_process = multiprocessing.Process(target = heartbeat, args = (self.queue_of_worker_list, self.worker_hostname, self.nprocs, self.working_status,))
146 | self.heartbeat_process.start()
147 |
148 | if not self.quiet:
149 | print('[{}] Listening to Master node {}:{}'.format(str(datetime.datetime.now()), self.IP, self.PORT))
150 |
151 | while True:
152 |
153 | try:
154 | if_job_q_empty = self.job_q.empty()
155 | except EOFError:
156 | print("[{}] Lost connection with Master node.".format(str(datetime.datetime.now())))
157 | sys.exit(1)
158 |
159 | if not if_job_q_empty and self.error_q.empty():
160 |
161 | print("[{}] Started working on some tasks.".format(str(datetime.datetime.now())))
162 |
163 |
164 | # load environment setup
165 | try:
166 | envir = self.envir_to_use.get(timeout = 3)
167 | self.envir_to_use.put(envir)
168 | except:
169 | sys.exit("[ERROR] Failed to get the environment statement from Master node.")
170 |
171 | # load task function
172 | try:
173 | target_func = self.target_func.get(timeout = 3)
174 | self.target_func.put(target_func)
175 | except:
176 | sys.exit("[ERROR] Failed to get the task function from Master node.")
177 |
178 | self.working_status.value = 1
179 | mp_apply(envir, target_func, self.job_q, self.result_q, self.error_q, self.dict_of_job_history, self.worker_hostname, self.nprocs)
180 | print("[{}] Tasks finished.".format(str(datetime.datetime.now())))
181 | self.working_status.value = 0
182 |
183 | time.sleep(0.1) # avoid too frequent communication which is unnecessary
184 |
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [bdist_wheel]
2 | universal=1
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | from os import path
3 |
4 | here = path.abspath(path.dirname(__file__))
5 |
6 | # Get the long description from the README file
7 | with open(path.join(here, 'README.rst')) as f:
8 | long_description = f.read()
9 |
10 |
11 | setup(
12 | name="minimalcluster",
13 | version="0.1.0.dev14",
14 | description='A minimal cluster computing framework',
15 | long_description=long_description,
16 | url='https://github.com/XD-DENG/minimalcluster-py',
17 | author='Xiaodong DENG',
18 | author_email='xd.deng.r@gmail.com',
19 | license='MIT',
20 | keywords='parallel cluster multiprocessing',
21 | packages=["minimalcluster"]
22 | )
23 |
--------------------------------------------------------------------------------