├── .github
└── CODEOWNERS
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── docs
├── configuration.md
├── images
│ ├── hub-stress-test-health.png
│ ├── hub-stress-test-request-response-times.png
│ ├── hub-stress-test-resource-usage.png
│ └── py-spy-example.svg
├── profiling.md
└── stress-test.md
├── requirements.txt
├── scripts
└── hub-stress-test.py
├── test-requirements.txt
└── tox.ini
/.github/CODEOWNERS:
--------------------------------------------------------------------------------
1 | * @mriedem @rmoe
2 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | purge.log
2 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 |
3 | python: 3.7
4 |
5 | cache: pip
6 |
7 | install: pip install tox
8 |
9 | script: tox
10 |
11 | jobs:
12 | include:
13 | - stage: test
14 | env: TOXENV=flake8
15 | - stage: test
16 | env: TOXENV=hub-stress-test
17 |
18 | notifications:
19 | email:
20 | on_success: never
21 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # jupyter-tools
2 |
3 | Collection of tools for working with JupyterHub and notebooks.
4 |
5 | ## Load Testing
6 |
7 | In order to support high load events we have tooling to run stress tests on our JupyterHub deployment.
8 |
9 | * [hub-stress-test](scripts/hub-stress-test.py): This script allows scaling up hundreds of fake users and notebook
10 | servers (pods) at once against a target JupyterHub cluster to see how it responds to sudden load, like event users
11 | signing on at the beginning of the event. It also allows for scaling up and having a steady state of many users
12 | to profile the performance of the hub. See [Hub Stress Testing](docs/stress-test.md) for more details.
13 |
14 | ## Configuration tuning
15 | There are various configuration settings you can modify to improve both steady-state and scale-up
16 | performance. See [Configuration settings](docs/configuration.md) for more details.
17 |
18 | ## Profiling
19 | Performance data can be collected during normal operations or a stress-test run. See
20 | [Profiling](docs/profiling.md) for more details.
21 |
--------------------------------------------------------------------------------
/docs/configuration.md:
--------------------------------------------------------------------------------
1 | # Configuration settings
2 |
3 | This document provides an overview of configuration settings for increasing hub performance.
4 |
5 | 1. [Culler settings](#culler)
6 | 1. [Frequency](#culler-frequency)
7 | 2. [Concurrency limit](#culler-concurrency)
8 | 3. [Timeout](#culler-timeout)
9 | 4. [Notebook culler](#notebook-culler)
10 | 2. [Activity intervals](#activity)
11 | 1. [`activity_resolution`](#activity-resolution)
12 | 2. [`last_activity_interval`](#last-activity-interval)
13 | 3. [`JUPYERHUB_ACTIVITY_INTERVAL`](#hub-activity-interval)
14 | 3. [Startup time](#startup)
15 | 1. [`init_spawners_timeout`](#spawners-timeout)
16 | 4. [Other settings](#other)
17 | 1. [`k8s_threadpool_api_workers`](#kubespawner-thread)
18 | 2. [Disable events](#kubespawner-events)
19 | 3. [Disable consecutiveFailureLimit](#disable-consecutivefailurelimit)
20 | 4. [Increase http_timeout](#increase-http-timeout)
21 | 5. [References](#references)
22 |
23 |
24 |
25 | ## Culler settings
26 | There are two mechanisms for controlling the culling of servers and users. One is a
27 | process managed by the hub which will periodically cull users and servers. The other
28 | is a setting which will allow servers to delete themselves after a period of inactivity.
29 |
30 |
31 | ### Frequency
32 | By default the culler runs every 10 minutes. With a more aggressive setting for the notebook
33 | idle timeout the hub-managed culler can be run less frequently.
34 |
35 |
36 | ### Concurrency limit
37 | By default the culler has a concurrency limit of 10. This means it will make up to 10
38 | concurrent API calls. When deleting a large number of users that can generate a high load
39 | on the hub. Setting this to `1` helps to reduce load on the hub.
40 |
41 |
42 | ### Timeout
43 | The timeout controls how long a server can be idle before being deleted. Because the servers
44 | will aggressively cull themselves this value can be set very high.
45 |
46 | These can be all configured in the `cull` section of [values.yaml](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/master/jupyterhub/values.yaml):
47 | ```yaml
48 | cull:
49 | timeout: 432000 # 5 days
50 | every: 3600 # Run once an hour instead of every 10 minutes
51 | concurrency: 1
52 | ```
53 |
54 |
55 | ### Notebook culler
56 | There are two settings which control how the notebooks cull themselves. The first is
57 | `c.NotebookApp.shutdown_no_activity_timeout` which specifies the period of inactivity
58 | (in seconds) before a server is shutdown. The second is `c.MappingKernelManager.cull_idle_timeout`
59 | which determines when kernels will be shutdown. These settings can be configured as described
60 | [here](https://jupyter-notebook.readthedocs.io/en/stable/config_overview.html).
61 |
62 |
63 | ## Activity intervals
64 | These settings control how spawner and user activity is tracked. These settings have
65 | a large impact on the performance of the hub.
66 |
67 |
68 | ### `c.JupyterHub.activity_resolution`
69 | Activity resolution controls how often activity updates are written to the database. Many
70 | API calls will record activity for a user. This setting determines whether or not that update
71 | is written to the database. If the update is more recent than `activity_resolution` seconds
72 | ago it's ignored. Increasing this value will reduce commits to the database.
73 |
74 | ```yaml
75 | extraConfig:
76 | myConfig: |
77 | c.JupyterHub.activity_resolution = 6000
78 | ```
79 |
80 |
81 | ### `c.JupyterHub.last_activity_interval`
82 | This setting controls how often a periodic task in the hub named `update_last_activity`
83 | runs. This task updates user activity using information from the proxy. This task makes
84 | a large number of database calls and can put a fairly significant load on the hub. Zero to
85 | Jupyterhub sets this to 1 minute by default. The upstream default of 5 minutes is a better
86 | setting.
87 |
88 | ```yaml
89 | extraConfig:
90 | myConfig: |
91 | c.JupyterHub.last_activity_interval = 300
92 | ```
93 |
94 |
95 | ### `JUPYTERHUB_ACTIVITY_INTERVAL`
96 | This controls how often each server reports its activity back to the hub. The default
97 | is 5 minutes and with hundreds or thousands of users posting activity updates it puts
98 | a heavy load on the hub and the hub's database. Increasing this to one hour or more
99 | reduces the load placed on the hub by these activity updates.
100 |
101 | ```yaml
102 | singleuser:
103 | extraEnv:
104 | JUPYTERHUB_ACTIVITY_INTERVAL: "3600"
105 | ```
106 |
107 |
108 | ## Startup time
109 |
110 |
111 | ### `init_spawners_timeout`
112 | [c.JupyterHub.init_spawners_timeout](https://jupyterhub.readthedocs.io/en/stable/api/app.html#jupyterhub.app.JupyterHub.init_spawners_timeout) controls how long the hub will wait for spawners to
113 | initialize. When this timeout is reached the spawner check will go into the background and
114 | hub startup will continue. With many hundreds or thousands of spawners this is always going
115 | to exceed any reasonable timeout so there's no reason to wait at all. Setting it to `1`
116 | (which is the minimum value) allows the hub to start faster and start servicing other requests.
117 |
118 | In `values.yaml`:
119 | ```yaml
120 | extraConfig:
121 | myConfig: |
122 | c.JupyterHub.init_spawners_timeout = 1
123 | ```
124 |
125 |
126 | ## Other settings
127 | Other settings which are helpful for tuning performance.
128 |
129 |
130 | ### `c.KubeSpawner.k8s_api_threadpool_workers`
131 | This value controls the number of threads `kubespawner` will create to make API calls to
132 | Kubernetes. The default is `5 * num_cpus`. Given a large enough number of users logging in
133 | and spawning servers at the same time this may not be enough threads. A more sensible value
134 | for this setting is [c.Jupyterhub.concurrent_spawn_limit](https://jupyterhub.readthedocs.io/en/stable/api/app.html#jupyterhub.app.JupyterHub.concurrent_spawn_limit).
135 | `concurrent_spawn_limit` controls how many users can spawn servers at the same time.
136 | By creating that many threadpool workers we ensure that there's always a thread available
137 | to service a user's spawn request. The upstream default for `concurrent_spawn_limit` is 100 while
138 | the default with Zero to JupyterHub is 64.
139 |
140 | In `values.yaml`:
141 | ```yaml
142 | extraConfig:
143 | perfConfig: |
144 | c.KubeSpawner.k8s_api_threadpool_workers = c.JupyterHub.concurrent_spawn_limit
145 | ```
146 |
147 |
148 | ### Disable user events
149 | With this enabled `kubespawner` will process events from the Kubernetes API which are then
150 | used to show progress on the user spawn page. Disabling this reduces the load on `kubespawner`.
151 |
152 | To disable user events update the `events` key in the `values.yaml` file. This value ultimately
153 | sets `c.KubeSpawner.events_enabled`.
154 |
155 | ```yaml
156 | singleuser:
157 | events: false
158 | ```
159 |
160 |
161 | ### Disable consecutiveFailureLimit
162 | JupyterHub itself defaults [c.Spawner.consecutive_failure_limit](https://jupyterhub.readthedocs.io/en/stable/api/spawner.html#jupyterhub.spawner.Spawner.consecutive_failure_limit) to 0 to disable it but zero-to-jupyterhub-k8s
163 | defaults it to [5](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/0.11.0/jupyterhub/values.yaml#L43).
164 | This can be problematic when a large user event starts and many users are starting server pods at the same time
165 | if the user node capacity is exhausted and, for example, spawns timeout due to waiting on the node auto-scaler adding
166 | more user node capacity. When the consecutive failure limit is reached the hub will restart which probably will not
167 | help with this type of failure scenario when pod spawn timeouts are occurring because of capacity issues.
168 |
169 | To disable the consecutive failure limit update the `consecutiveFailureLimit` key in the `values.yaml` file.
170 |
171 | ```yaml
172 | hub:
173 | consecutiveFailureLimit: 0
174 | ```
175 |
176 |
177 | ### Increase http_timeout
178 |
179 | [`c.KubeSpawner.http_timeout`](https://jupyterhub.readthedocs.io/en/stable/api/spawner.html#jupyterhub.spawner.Spawner.http_timeout)
180 | defaults to 30 seconds. During scale and load testing we have seen that sometimes
181 | we can hit this timeout and the hub will delete the server pod but if we had just waited a few seconds more it
182 | would have been enough. So if you have node capacity so that pods are being created, but maybe they are just
183 | slow to come up and are hitting this timeout, you might want to increase it to something like 60 seconds. This
184 | also seems to vary depending on whether you are using `notebook` or `jupyterlab` / `jupyter-server`, the type of
185 | backing storage for the user pods (i.e. s3fs shared object storage is known to be slow(er)), and how many and what kinds of
186 | extensions you have in the user image.
187 |
188 |
189 | ## References
190 | - https://discourse.jupyter.org/t/confusion-of-the-db-instance/3878
191 | - https://discourse.jupyter.org/t/identifying-jupyterhub-api-performance-bottleneck/1289
192 | - https://discourse.jupyter.org/t/minimum-specs-for-jupyterhub-infrastructure-vms/5309
193 | - https://discourse.jupyter.org/t/background-for-jupyterhub-kubernetes-cost-calculations/5289
194 | - https://discourse.jupyter.org/t/core-component-resilience-reliability/5433
195 | - https://discourse.jupyter.org/t/scheduler-insufficient-memory-waiting-errors-any-suggestions/5314
196 |
--------------------------------------------------------------------------------
/docs/images/hub-stress-test-health.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/jupyter-tools/747256d3a994ab36abfeba501f14cc05307facd9/docs/images/hub-stress-test-health.png
--------------------------------------------------------------------------------
/docs/images/hub-stress-test-request-response-times.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/jupyter-tools/747256d3a994ab36abfeba501f14cc05307facd9/docs/images/hub-stress-test-request-response-times.png
--------------------------------------------------------------------------------
/docs/images/hub-stress-test-resource-usage.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IBM/jupyter-tools/747256d3a994ab36abfeba501f14cc05307facd9/docs/images/hub-stress-test-resource-usage.png
--------------------------------------------------------------------------------
/docs/images/py-spy-example.svg:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/docs/profiling.md:
--------------------------------------------------------------------------------
1 | # Profiling
2 | During stress testing, or even normal operations, [py-spy](https://github.com/benfred/py-spy) can be used to capture profiling data. It will generate SVGs
3 | that show where the hub is spending its time.
4 |
5 | 1. [Installation](#py-spy-installation)
6 | 1. [Collecting data](#py-spy-collecting-data)
7 | 1. [`py-spy top`](#py-spy-top)
8 | 1. [`py-spy record`](#py-spy-record)
9 | 1. [`py-spy dump`](#py-spy-dump)
10 |
11 |
12 | ### Installation
13 | `py-spy` is installed by z2jh but it won't work without additional configuration. The hub image
14 | must be modified to set the `SYS_PTRACE` capability on the `py-spy` binary. The following line
15 | should be added after `py-spy` has been installed and before root privileges are dropped during
16 | the image build.
17 |
18 | ```
19 | RUN setcap cap_sys_ptrace+ep $(which py-spy)
20 | ```
21 |
22 | Additionally the `securityContext` must be configured in the hub deployment. This is done
23 | by setting `containerSecurityContext` in `values.yaml`.
24 |
25 | ```yaml
26 | containerSecurityContext:
27 | allowPrivilegeEscalation: true
28 | capabilities:
29 | drop:
30 | - all
31 | add:
32 | - SYS_PTRACE
33 | ```
34 |
35 |
36 | ### Collecting data
37 | `py-spy` must be run from inside the hub container.
38 |
39 | ```bash
40 | $ kubectl -n exec -it -- bash
41 | ```
42 |
43 | There are three ways to investigate the hub's activity.
44 |
45 |
46 | 1. `py-spy top`
47 | This shows a live view of which functions are taking most time. It's similar to the
48 | Linux `top` command.
49 | ```
50 | jovyan@hub-69f94ddc84-g26qj:/$ py-spy top -p 1
51 |
52 | Collecting samples from '/usr/bin/python3 /usr/local/bin/jupyterhub --config /etc/jupyterhub/jupyterhub_config.py --upgrade-db' (python v3.6.9)
53 | Total Samples 1118
54 | GIL: 0.00%, Active: 0.00%, Threads: 1
55 |
56 | %Own %Total OwnTime TotalTime Function (filename:line)
57 | 0.00% 0.00% 0.010s 0.010s add_timeout (tornado/ioloop.py:580)
58 | 0.00% 0.00% 0.000s 0.010s (jupyterhub:11)
59 | 0.00% 0.00% 0.000s 0.010s start (tornado/platform/asyncio.py:149)
60 | 0.00% 0.00% 0.000s 0.010s _run_callback (tornado/ioloop.py:743)
61 | 0.00% 0.00% 0.000s 0.010s _schedule_next (tornado/ioloop.py:916)
62 | 0.00% 0.00% 0.000s 0.010s launch_instance (jupyterhub/app.py:2782)
63 | 0.00% 0.00% 0.000s 0.010s run_forever (asyncio/base_events.py:438)
64 | 0.00% 0.00% 0.000s 0.010s _run (tornado/ioloop.py:911)
65 | 0.00% 0.00% 0.000s 0.010s _run_once (asyncio/base_events.py:1451)
66 | 0.00% 0.00% 0.000s 0.010s _run (asyncio/events.py:145)
67 | ```
68 | The output can be sorted by each column as well.
69 |
70 | 1. `py-spy record`
71 | The record command runs in the foreground collecting samples and when it's closed (either
72 | by CTRL+C or when it reaches its configured duration) an SVG flamegraph is written to disk.
73 | ```
74 | jovyan@hub-69f94ddc84-g26qj:/$ py-spy record -o /tmp/py-spy-trace -p 1
75 | py-spy> Sampling process 100 times a second. Press Control-C to exit.
76 |
77 | ^C
78 | py-spy> Stopped sampling because Control-C pressed
79 | py-spy> Wrote flamegraph data to '/tmp/py-spy-trace'. Samples: 20541 Errors: 0
80 | ```
81 | That will produce an SVG like this:
82 | 
83 |
84 | 1. `py-spy dump`
85 | The dump command will dump the state of all threads for the specified process. It can
86 | optionally show the local variables for each frame. This is helpful for figuring out why
87 | the hub process may appear stuck for example.
88 |
89 | ```
90 | jovyan@hub-69f94ddc84-g26qj:/$ py-spy dump -p 1 --locals
91 | Process 1: /usr/bin/python3 /usr/local/bin/jupyterhub --config /etc/jupyterhub/jupyterhub_config.py --upgrade-db
92 | Python v3.6.9 (/usr/bin/python3.6)
93 |
94 | Thread 1 (idle): "MainThread"
95 | select (selectors.py:445)
96 | Arguments::
97 | self:
98 | timeout: 0.998
99 | Locals::
100 | max_ev: 3
101 | ready: []
102 | _run_once (asyncio/base_events.py:1415)
103 | Arguments::
104 | self: <_UnixSelectorEventLoop at 0x7f79ac4ab7b8>
105 | Locals::
106 | sched_count: 45
107 | timeout: 0.9979082886129618
108 | when: 5094106.57817052
109 | run_forever (asyncio/base_events.py:438)
110 | Arguments::
111 | self: <_UnixSelectorEventLoop at 0x7f79ac4ab7b8>
112 | Locals::
113 | old_agen_hooks: (None, None)
114 | start (tornado/platform/asyncio.py:149)
115 | Arguments::
116 | self:
117 | Locals::
118 | old_loop: <_UnixSelectorEventLoop at 0x7f79ac4ab7b8>
119 | launch_instance (jupyterhub/app.py:2782)
120 | Arguments::
121 | cls:
122 | argv: None
123 | Locals::
124 | self:
125 | loop:
126 | task: <_asyncio.Task at 0x7f79abd788c8>
127 | (jupyterhub:11)
128 | ```
129 |
--------------------------------------------------------------------------------
/docs/stress-test.md:
--------------------------------------------------------------------------------
1 | # Hub Stress Testing
2 |
3 | This document gives an overview of the [hub-stress-test script](../scripts/hub-stress-test.py)
4 | and how it can be used.
5 |
6 | 1. [Setup](#setup)
7 | 1. [Scaling up](#scaling-up)
8 | 1. [Placeholders and user nodes](#placeholders)
9 | 1. [Steady state testing](#steady-state)
10 | 1. [Activity update testing](#activity-update-testing)
11 | 1. [Scaling down](#scaling-down)
12 | 1. [Monitoring](#monitoring)
13 |
14 |
15 | ## Setup
16 |
17 | You will need two things to run the script, an admin token and a target hub API endpoint URL.
18 |
19 | The admin token can be provided to the script on the command line but it's recommended to create a
20 | file from which you can source and export the `JUPYTERHUB_API_TOKEN` environment variable.
21 |
22 | For the hub API endpoint URL, you can probably use the same value as the `JUPYTERHUB_API_URL`
23 | environment variable in your user notebooks, e.g. `https://myhub-testing.acme.com/hub/api`.
24 |
25 | Putting these together, you can have a script like the following to prepare your environment:
26 |
27 | ```bash
28 | #!/bin/bash -e
29 | export JUPYTERHUB_API_TOKEN=abcdef123456
30 | export JUPYTERHUB_ENDPOINT=https://myhub-testing.acme.com/hub/api
31 | ```
32 |
33 |
34 | ## Scaling up
35 |
36 | By default the `stress-test` command of the `hub-stress-test` script will scale up to
37 | 100 users and notebook servers (pods) in batches, wait for them to be "ready" and then
38 | stop and delete them.
39 |
40 |
41 | ### Placeholders and user nodes
42 |
43 | The number of pods that can be created in any given run depends on the number of
44 | `user-placeholder` pods already in the cluster and the number of `user` nodes. The
45 | `user-placeholder` pods are pre-emptible pods which are part of a StatefulSet:
46 |
47 | ```console
48 | $ kubectl get statefulset/user-placeholder -n jhub
49 | NAME READY AGE
50 | user-placeholder 300/300 118d
51 | ```
52 |
53 | We normally have very few of these in our testing cluster but need to
54 | scale them up when doing stress testing otherwise the `hub-stress-test` script has to wait
55 | for the auto-scaler to add more nodes to the `user` worker pool. The number of available
56 | workers can be found like so:
57 |
58 | ```console
59 | $ kubectl get nodes -l workerPurpose=users | grep -c "Ready\s"
60 | 13
61 | ```
62 |
63 | The number of `user` nodes needed for a scale test will depend on the resource requirements
64 | of the user notebook pods, reserved space on the nodes, other system pods running on the nodes,
65 | e.g. logging daemon, pod limits per node, etc.
66 |
67 | If there are not enough nodes available and the auto-scaler has to create them
68 | as the stress test is running, we can hit the [consecutive failure limit](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/363d0b7db5/jupyterhub/values.yaml#L17) which will cause the hub container to crash and restart.
69 | One way to avoid this is run the script with a `--count` that is not higher than 500 which
70 | gives time between runs for the auto-scaler to add more `user` nodes.
71 |
72 | As an example, the `kubelet` default `maxPods` limit is 110 per node and on IBM Cloud there are about
73 | 25 system pods per node. Our testing cluster user notebooks are using a micro
74 | profile so their resource usage is not an issue, they are just limited to the 110 pod-per-node limit.
75 | As a reference, to scale up to 3000 users/pods we need to have at least 35 user nodes.
76 |
77 |
78 | ### Steady state testing
79 |
80 | The `--keep` option can be used to scale up the number of pods in the cluser and retain them
81 | so that you can perform tests or profiling on the hub with a high load. When the script runs
82 | it will first check for the number of existing `hub-stress-test` users and start creating new
83 | users based on an index so you can run the script with a `--count` value of 200-500 if you need
84 | to let the auto-scaler add `user` nodes after each run.
85 |
86 | Note that the `c.NotebookApp.shutdown_no_activity_timeout` value in the user notebook image (in the
87 | testing cluster) should either be left at the default (0) or set to some larger window so that while
88 | you are scaling up the notebook pods do not shut themselves down.
89 |
90 |
91 | ### Activity update testing
92 |
93 | The `activity-stress-test` command can be used to simulate `--count` users POSTing activity
94 | updates. This command only creates users, not servers. It takes a number of users to simulate
95 | specified by `--count` and a number of worker threads, `--workers`, to perform the actual
96 | requests. If `--keep` isn't specified then the users will be deleted after the test.
97 |
98 |
99 | ## Scaling down
100 |
101 | If you used the `--keep` option to scale up and retain pods for steady state testing, when you are
102 | done you can scale down the pods and users by using the `purge` command. The users created by the
103 | script all have a specific naming convention so it knows which notebook servers to stop and users
104 | to remove.
105 |
106 |
107 | ## Monitoring
108 |
109 | Depending on the number of pods being created or deleted the script can take awhile. During a run
110 | there are some dashboards you should be watching and also the hub logs. The logging and monitoring
111 | platform is deployment-specific but the following are some examples of dashboards we monitor:
112 |
113 | * `Jupyter Notebook Health (Testing)`
114 | This dashboard shows the active user notebook pods, nodes in the cluster and `user-placeholder`
115 | pods. This is mostly interesting to watch the active user notebook pods go up or down when scaling
116 | up or down with the script. The placeholders and user nodes may also fluctuate as placeholder pods
117 | are pre-empted and as the auto-scaler is adding or removing user nodes.
118 |
119 | 
120 |
121 | * `Jupyter Hub Golden Signals (testing)`
122 | This is where you can monitor the response time and request rate on the hub. As user notebook pods
123 | are scaled up each of those pods will "check in" with the hub to report their activity. By default
124 | each pod checks in with the hub every [5 minutes](https://github.com/jupyterhub/jupyterhub/blob/5dee864af/jupyterhub/singleuser.py#L463). So we expect that the more active user notebook pods in the cluster will increase
125 | the request rate and increase the response time in this dashboard. The error rates may also increase
126 | as we get 429 responses from the hub while scaling up due to the [concurrentSpawnLimit](https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/363d0b7db/jupyterhub/values.yaml#L16). Those 429 responses are expected
127 | and the `hub-stress-test` script is built to retry on them. Here is an example of a 3000 user load
128 | run:
129 |
130 | 
131 |
132 | That run started around 2:30 and then the purge started around 9 which is why response times track
133 | the increase in request rates. As the purge runs the number of pods reporting activity is going down
134 | so the request rate also goes down. One thing to note on the purge is that the [slow_stop_timeout](https://github.com/jupyterhub/jupyterhub/blob/42adb4415/jupyterhub/handlers/base.py#L761) defaults to 10 seconds so as
135 | we are stopping user notebook servers (deleting pods) the response times spike up because of that
136 | arbitrary 10 second delay in the hub API.
137 |
138 | Other useful panels on this dashboard are for tracking CPU and memory usage of the hub. From the same
139 | 3000 user run as above:
140 |
141 | 
142 |
143 | CPU, memory and network I/O increase as the number of user notebook pods increases and are reporting
144 | activity to the hub. The decrease in CPU and network I/O are when the purge starts running. Note that
145 | memory usage remains high even after the purge starts because the hub aggressively caches DB state in
146 | memory and is apparently not cleaning up the cache references even after spawners and users are deleted
147 | from the database.
148 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | urllib3
3 |
--------------------------------------------------------------------------------
/scripts/hub-stress-test.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import argparse
4 | from concurrent import futures
5 | from datetime import datetime, timedelta
6 | import functools
7 | import json
8 | import logging
9 | from unittest import mock
10 | import os
11 | import random
12 | import sys
13 | import time
14 |
15 | import requests
16 | from requests import adapters
17 | from urllib3.util import retry
18 |
19 |
20 | LOG_FORMAT = "%(asctime)s %(levelname)s [%(name)s] %(message)s"
21 | LOG = logging.getLogger('hub-stress-test')
22 |
23 | # POST /users/{name}/servers can take over 10 seconds so be conservative with
24 | # the default timeout value.
25 | DEFAULT_TIMEOUT = 30
26 |
27 | # The default timeout for waiting on a server status change (starting/stopping)
28 | SERVER_LIFECYCLE_TIMEOUT = 60
29 |
30 | USERNAME_PREFIX = 'hub-stress-test'
31 |
32 |
33 | def parse_args():
34 | # Consider splitting this into sub-commands in case you want to be able to
35 | # scale up and retain servers to do some profiling and then have another
36 | # command to scale down when done. It could also be useful to have a
37 | # sub-command to report information about the hub, e.g. current number of
38 | # users/servers and which of those were created by this tool.
39 | parser = argparse.ArgumentParser(
40 | formatter_class=argparse.RawDescriptionHelpFormatter,
41 | description='''
42 | JupyterHub Stress Test
43 |
44 | The `stress-test` command will create `--count` number of fake users and
45 | notebook servers in batches defined by the `--batch-size` option in the
46 | given JupyterHub `--endpoint`. It will wait for each notebook server to
47 | be considered "ready" by the hub. By default the created users and servers
48 | will be deleted but the `--keep` option can be used to retain the resources
49 | for steady-state profiling. The `purge` command is available to delete any
50 | previously kept users/servers.
51 |
52 | The `activity-stress-test` command simulates user activity updates. This
53 | will create `--count` fake users with no server. These users will be
54 | deleted unless `--keep` is specified. A number of threads specified by
55 | `--workers` will be created to send updates to the hub. While these worker
56 | threads are sending activity another thread makes requests to the API and
57 | reports on the average, minimum, and maximum time of that API call.
58 |
59 | An admin API token is required and may be specified using the
60 | JUPYTERHUB_API_TOKEN environment variable.
61 |
62 | Similarly the hub API endpoint must be provided and may be specified using the
63 | JUPYTERHUB_ENDPOINT environment variable.
64 |
65 | A `--dry-run` option is available for seeing what the test would look like
66 | without actually making any changes, for example:
67 |
68 | JUPYTERHUB_API_TOKEN=test
69 | JUPYTERHUB_ENDPOINT=http://localhost:8000/hub/api
70 | python hub-stress-test.py -v --dry-run stress-test
71 | ''')
72 | parser.add_argument('-e', '--endpoint',
73 | default=os.environ.get('JUPYTERHUB_ENDPOINT'),
74 | help='The target hub API endpoint for the stress '
75 | 'test. Can also be read from the '
76 | 'JUPYTERHUB_ENDPOINT environment variable.')
77 | parser.add_argument('-t', '--token',
78 | default=os.environ.get('JUPYTERHUB_API_TOKEN'),
79 | help='JupyterHub admin API token. Must be a token '
80 | 'for an admin user in order to create other fake '
81 | 'users for the scale test. Can also be read from '
82 | 'the JUPYTERHUB_API_TOKEN environment variable.')
83 | parser.add_argument('--dry-run', action='store_true',
84 | help='If set do not actually make API requests.')
85 | # Note that with nargs='?' if --log-to-file is specified but without an
86 | # argument value then it will be True (uses the const value) and we'll
87 | # generate a log file under /tmp. If --log-to-file is not specified at all
88 | # then it will default to False and we'll log to stdout. Otherwise if
89 | # --log-to-file is specified with a command line argument we'll log to that
90 | # file.
91 | parser.add_argument('--log-to-file', nargs='?', default=False, const=True,
92 | metavar='FILEPATH',
93 | help='If set logging will be redirected to a file. If '
94 | 'no FILEPATH value is provided then a '
95 | 'timestamp-based log file under /tmp will be '
96 | 'created. Note that if a FILEPATH value is given '
97 | 'an existing file will be overwritten.')
98 | parser.add_argument('-v', '--verbose', action='store_true',
99 | help='Enable verbose (debug) logging which includes '
100 | 'logging API response times.')
101 |
102 | # This parser holds arguments that need to be shared among two or more
103 | # subcommands but should not be top-level arguments.
104 | parent_parser = argparse.ArgumentParser(add_help=False)
105 | parent_parser.add_argument(
106 | '-k', '--keep', action='store_true',
107 | help='Retain the created fake users/servers once they all created. '
108 | 'By default the script will scale up and then teardown. The '
109 | 'script can be run with --keep multiple times to build on an '
110 | 'existing set of fake users.'
111 | )
112 | parent_parser.add_argument('-c', '--count', default=100, type=int,
113 | help='Number of users/servers (pods) to create '
114 | '(default: 100).')
115 |
116 | subparsers = parser.add_subparsers(dest='command', required=True)
117 | stress_parser = subparsers.add_parser(
118 | 'stress-test', parents=[parent_parser]
119 | )
120 | stress_parser.add_argument(
121 | '-b', '--batch-size', default=10, type=int,
122 | help='Batch size to use when creating users and notebook servers. '
123 | 'Note that by default z2jh will limit concurrent server creation '
124 | 'to 64 (see c.JupyterHub.concurrent_spawn_limit) (default: 10). '
125 | )
126 | stress_parser.add_argument(
127 | '-p', '--profile', type=str, required=False,
128 | help='Hardware profile for servers.'
129 | )
130 |
131 | activity_parser = subparsers.add_parser(
132 | 'activity-stress-test', parents=[parent_parser]
133 | )
134 | activity_parser.add_argument(
135 | '--workers', type=int, default=100,
136 | help='Number of worker threads to create. Each thread will receive '
137 | 'len(users) // workers users to send updates for.'
138 | )
139 |
140 | # Add a standalone purge subcommand
141 | subparsers.add_parser('purge')
142 |
143 | args = parser.parse_args()
144 | return args
145 |
146 |
147 | def validate(args):
148 | if args.command == 'stress-test':
149 | if args.batch_size < 1:
150 | raise Exception('--batch-size must be greater than 0')
151 | if args.count < 1:
152 | raise Exception('--count must be greater than 0')
153 | if args.token is None:
154 | raise Exception('An API token must be provided either using --token '
155 | 'or the JUPYTERHUB_API_TOKEN environment variable')
156 | if args.endpoint is None:
157 | raise Exception('A hub API endpoint URL must be provided either using '
158 | '--endpoint or the JUPYTERHUB_ENDPOINT environment '
159 | 'variable')
160 |
161 |
162 | def setup_logging(verbose=False, log_to_file=False, args=None):
163 | filename = None
164 | if log_to_file: # If --log-to-file is specified at all this is Truthy
165 | if isinstance(log_to_file, str): # A specific file is given so use it.
166 | filename = log_to_file
167 | else: # --log-to-file with no arg so generate a tmp file for logging.
168 | timestamp = datetime.utcnow().isoformat(timespec='seconds')
169 | filename = os.path.join(
170 | '/tmp', f'hub-stress-test-{timestamp}.log')
171 | print(f'Redirecting logs to: {filename}')
172 | logging.basicConfig(format=LOG_FORMAT, filename=filename, filemode='w')
173 | root_logger = logging.getLogger(None)
174 | root_logger.setLevel(logging.INFO)
175 | if verbose:
176 | root_logger.setLevel(logging.DEBUG)
177 | logging.getLogger('urllib3.connectionpool').setLevel(logging.WARNING)
178 |
179 | if log_to_file and args:
180 | # Log the args used to run the script for posterity.
181 | # Scrub the token though so we don't log it.
182 | args_dict = dict(vars(args)) # Make sure to copy the vars dict.
183 | args_dict['token'] = '***'
184 | LOG.info('Args: %s', args_dict)
185 |
186 | def log_uncaught_exceptions(exc_type, exc_value, exc_traceback):
187 | root_logger.critical("Uncaught exception",
188 | exc_info=(exc_type, exc_value, exc_traceback))
189 |
190 | sys.excepthook = log_uncaught_exceptions
191 |
192 |
193 | def timeit(f):
194 | @functools.wraps(f)
195 | def wrapper(*args, **kwargs):
196 | start_time = time.time()
197 | try:
198 | return f(*args, **kwargs)
199 | finally:
200 | LOG.info('Took %.3f seconds to %s',
201 | (time.time() - start_time), f.__name__)
202 | return wrapper
203 |
204 |
205 | def log_response_time(resp, *args, **kwargs):
206 | """Logs response time elapsed.
207 |
208 | See: https://requests.readthedocs.io/en/master/user/advanced/#event-hooks
209 |
210 | :param resp: requests.Response object
211 | :param args: ignored
212 | :param kwargs: ignored
213 | """
214 | LOG.debug('%(method)s %(url)s status:%(status)s time:%(elapsed)ss',
215 | {'method': resp.request.method,
216 | 'url': resp.url,
217 | 'status': resp.status_code,
218 | 'elapsed': resp.elapsed.total_seconds()})
219 |
220 |
221 | def get_session(token, dry_run=False, pool_maxsize=100):
222 | if dry_run:
223 | return mock.create_autospec(requests.Session)
224 | session = requests.Session()
225 | session.headers.update({'Authorization': 'token %s' % token})
226 | # Retry on errors that might be caused by stress testing.
227 | r = retry.Retry(
228 | backoff_factor=0.5,
229 | method_whitelist=False, # retry on any verb (including POST)
230 | status_forcelist={
231 | 429, # concurrent_spawn_limit returns a 429
232 | 503, # if the hub container crashes we get a 503
233 | 504, # if the cloudflare gateway times out we get a 504
234 | })
235 | adapter = adapters.HTTPAdapter(max_retries=r, pool_maxsize=pool_maxsize)
236 | session.mount("http://", adapter)
237 | session.mount("https://", adapter)
238 | if LOG.isEnabledFor(logging.DEBUG):
239 | session.hooks['response'].append(log_response_time)
240 | return session
241 |
242 |
243 | def wait_for_server_to_stop(username, endpoint, session):
244 | count = 1
245 | while count <= SERVER_LIFECYCLE_TIMEOUT:
246 | resp = session.get(endpoint + '/users/%s' % username)
247 | if resp:
248 | user = resp.json()
249 | # When the server is stopped the servers dict should be empty.
250 | if not user.get('servers') or isinstance(user, mock.Mock):
251 | return True
252 | LOG.debug('Still waiting for server for user %s to stop, '
253 | 'attempt: %d', username, count)
254 | elif resp.status_code == 404:
255 | # Was the user deleted underneath us?
256 | LOG.info('Got 404 while waiting for server for user %s to '
257 | 'stop: %s', username, resp.content)
258 | # Consider this good if the user is gone.
259 | return True
260 | else:
261 | LOG.warning('Unexpected error while waiting for server for '
262 | 'user %s to stop: %s', username, resp.content)
263 | time.sleep(1)
264 | count += 1
265 | else:
266 | LOG.warning('Timed out waiting for server for user %s to stop after '
267 | '%d seconds', username, SERVER_LIFECYCLE_TIMEOUT)
268 | return False
269 |
270 |
271 | def stop_server(username, endpoint, session, wait=False):
272 | resp = session.delete(endpoint + '/users/%s/server' % username,
273 | timeout=DEFAULT_TIMEOUT)
274 | if resp:
275 | # If we got a 204 then the server is stopped and we should not
276 | # need to poll.
277 | if resp.status_code == 204:
278 | return True
279 | if wait:
280 | return wait_for_server_to_stop(username, endpoint, session)
281 | # We're not going to wait so just return True to indicate that we
282 | # successfully sent the stop request.
283 | return True
284 | else:
285 | LOG.warning('Failed to stop server for user %s. Response status '
286 | 'code: %d. Response content: %s', username,
287 | resp.status_code, resp.content)
288 | return False
289 |
290 |
291 | @timeit
292 | def stop_servers(usernames, endpoint, session, batch_size):
293 | stopped = {} # map of username to whether or not the server was stopped
294 | LOG.debug('Stopping servers for %d users in batches of %d',
295 | len(usernames), batch_size)
296 | # Do this in batches in a ThreadPoolExecutor because the
297 | # `slow_stop_timeout` default of 10 seconds in the hub API can cause the
298 | # stop action to be somewhat synchronous.
299 | with futures.ThreadPoolExecutor(
300 | max_workers=batch_size,
301 | thread_name_prefix='hub-stress-test:stop_servers') as executor:
302 | future_to_username = {
303 | executor.submit(stop_server, username, endpoint, session): username
304 | for username in usernames
305 | }
306 | # as_completed returns an iterator which yields futures as they
307 | # complete
308 | for future in futures.as_completed(future_to_username):
309 | username = future_to_username[future]
310 | stopped[username] = future.result()
311 | return stopped
312 |
313 |
314 | @timeit
315 | def wait_for_servers_to_stop(stopped, endpoint, session):
316 | """Wait for a set of user servers to stop.
317 |
318 | :param stopped: dict of username to boolean value of whether or not the
319 | server stop request was successful because if not we don't wait for
320 | that server; if the boolean value is True then it is updated in-place
321 | with the result of whether or not the server was fully stopped
322 | :param endpoint: base endpoint URL
323 | :param session: requests.Session instance
324 | """
325 | LOG.debug('Waiting for servers to stop')
326 | for username, was_stopped in stopped.items():
327 | # Only wait if we actually successfully tried to stop it.
328 | if was_stopped:
329 | # Update our tracking flag by reference.
330 | stopped[username] = wait_for_server_to_stop(
331 | username, endpoint, session)
332 |
333 |
334 | @timeit
335 | def delete_users_after_stopping_servers(stopped, endpoint, session):
336 | """Delete users after stopping their servers.
337 |
338 | :param stopped: dict of username to boolean value of whether or not the
339 | server was successfully stopped
340 | :param endpoint: base endpoint URL
341 | :param session: requests.Session instance
342 | :returns: True if all users were successfully deleted, False otherwise
343 | """
344 | LOG.debug('Deleting users now that servers are stopped')
345 | success = True
346 | for username, was_stopped in stopped.items():
347 | resp = session.delete(endpoint + '/users/%s' % username,
348 | timeout=DEFAULT_TIMEOUT)
349 | if resp:
350 | LOG.debug('Deleted user: %s', username)
351 | elif resp.status_code == 404:
352 | LOG.debug('User already deleted: %s', username)
353 | else:
354 | LOG.warning('Failed to delete user: %s. Response status code: %d. '
355 | 'Response content: %s. Was the server stopped? %s',
356 | username, resp.status_code, resp.content, was_stopped)
357 | success = False
358 | return success
359 |
360 |
361 | @timeit
362 | def delete_users(usernames, endpoint, session, batch_size=10):
363 | # Do this in batches by first explicitly stopping all of the servers since
364 | # that could be asynchronous, then wait for the servers to be stopped and
365 | # then finally delete the users.
366 | stopped = stop_servers(usernames, endpoint, session, batch_size)
367 |
368 | # Now wait for the servers to be stopped. With a big list the ones at the
369 | # end should be done by the time we get to them.
370 | wait_for_servers_to_stop(stopped, endpoint, session)
371 |
372 | # Now try to delete the users.
373 | return delete_users_after_stopping_servers(stopped, endpoint, session)
374 |
375 |
376 | @timeit
377 | def create_users(count, batch_size, endpoint, session, existing_users=[]):
378 | LOG.info('Start creating %d users in batches of %d at %s',
379 | count, batch_size, endpoint)
380 | # POST /users is a synchronous call so the timeout should be the batch size
381 | # or greater.
382 | timeout = max(batch_size, DEFAULT_TIMEOUT)
383 | num_existing_users = len(existing_users)
384 | index = num_existing_users + 1
385 | users = [] # Keep track of the batches to create servers.
386 | while index <= count + num_existing_users:
387 | # Batch create multiple users in a single request.
388 | usernames = []
389 | for _ in range(batch_size):
390 | usernames.append('%s-%d' % (USERNAME_PREFIX, index))
391 | index += 1
392 | # Maybe we should use the single user POST so we can deal with 409s
393 | # gracefully if we are re-running the script on a set of existing users
394 | resp = session.post(endpoint + '/users', json={'usernames': usernames},
395 | timeout=timeout)
396 | if resp:
397 | LOG.debug('Created users: %s', usernames)
398 | users.append(usernames)
399 | else:
400 | LOG.error('Failed to create users: %s. Response status code: %d. '
401 | 'Response content: %s', usernames, resp.status_code,
402 | resp.content)
403 | try:
404 | delete_users(usernames, endpoint, session)
405 | except Exception:
406 | LOG.warning('Failed to delete users: %s', usernames,
407 | exc_info=True)
408 | raise Exception('Failed to create users.')
409 | return users
410 |
411 |
412 | def start_server(username, endpoint, session, profile=None):
413 | if profile:
414 | profile = {"profile": profile}
415 | resp = session.post(endpoint + '/users/%s/server' % username,
416 | timeout=DEFAULT_TIMEOUT, json=profile)
417 | if resp:
418 | LOG.debug('Server for user %s is starting', username)
419 | else:
420 | # Should we delete the user now? Should we stop or keep going?
421 | LOG.error('Failed to create server for user: %s. '
422 | 'Response status code: %d. Response content: %s',
423 | username, resp.status_code, resp.content)
424 |
425 |
426 | @timeit
427 | def start_servers(users, endpoint, session, profile=None):
428 | LOG.info('Starting notebook servers')
429 | for index, usernames in enumerate(users):
430 | # Start the servers in batches using a ThreadPoolExecutor because
431 | # the start operation is not totally asynchronous so we should be able
432 | # to speed this up by doing the starts concurrently. That will also be
433 | # more realistic to users logging on en masse during an event.
434 | thread_name_prefix = f'hub-stress-test:start_servers:{index}'
435 | with futures.ThreadPoolExecutor(
436 | max_workers=len(usernames),
437 | thread_name_prefix=thread_name_prefix) as executor:
438 | for username in usernames:
439 | executor.submit(
440 | start_server, username, endpoint, session,
441 | profile=profile
442 | )
443 |
444 |
445 | @timeit
446 | def wait_for_servers_to_start(users, endpoint, session):
447 | LOG.info('Waiting for notebook servers to be ready')
448 | # Rather than do a GET for each individual user/server, we could get all
449 | # users and then filter out any that aren't in our list. However, there
450 | # could be servers in that list that are ready (the ones created first) and
451 | # others that are not yet (the ones created last). If we check individually
452 | # then there is a chance that by the time we get to the end of the list
453 | # those servers are already ready while we waited for those at the front of
454 | # the list.
455 | for usernames in users:
456 | for username in usernames:
457 | count = 0 # start our timer
458 | while count < SERVER_LIFECYCLE_TIMEOUT:
459 | resp = session.get(endpoint + '/users/%s' % username)
460 | if resp:
461 | user = resp.json()
462 | # We don't allow named servers so the user should have a
463 | # single server named ''.
464 | server = user.get('servers', {}).get('', {})
465 | if server.get('ready'):
466 | LOG.debug('Server for user %s is ready after %d '
467 | 'checks', username, count + 1)
468 | break
469 | elif not server.get('pending'):
470 | # It's possible that the server failed to start and in
471 | # that case we want to break the loop so we don't wait
472 | # needlessly until the timeout.
473 | LOG.error('Server for user %s failed to start. Waited '
474 | '%d seconds but the user record has no '
475 | 'pending action. Check the hub logs for '
476 | 'details. User: %s', username, count, user)
477 | break
478 | else:
479 | LOG.warning('Failed to get user: %s. Response status '
480 | 'code: %d. Response content: %s', username,
481 | resp.status_code, resp.content)
482 | time.sleep(1)
483 | count += 1
484 | else:
485 | # Should we fail here?
486 | LOG.error('Timed out waiting for server for user %s to be '
487 | 'ready after %d seconds', username,
488 | SERVER_LIFECYCLE_TIMEOUT)
489 |
490 |
491 | @timeit
492 | def find_existing_stress_test_users(endpoint, session):
493 | """Finds all existing hub-stress-test users.
494 |
495 | :param endpoint: base endpoint URL
496 | :param session: requests.Session instance
497 | :returns: list of existing hub-stress-test users
498 | """
499 | # This could be a lot of users so make the timeout conservative.
500 | resp = session.get(endpoint + '/users', timeout=120)
501 | if resp:
502 | users = resp.json()
503 | LOG.debug('Found %d existing users in the hub', len(users))
504 | if users:
505 | users = list(
506 | filter(lambda user: user['name'].startswith(USERNAME_PREFIX),
507 | users))
508 | LOG.debug('Found %d existing hub-stress-test users', len(users))
509 | return users
510 | else:
511 | # If the token is bad then we want to bail.
512 | if resp.status_code == 403:
513 | raise Exception('Invalid token')
514 | LOG.warning('Failed to list existing users: %s', resp.content)
515 | return []
516 |
517 |
518 | @timeit
519 | def run_stress_test(count, batch_size, token, endpoint, dry_run=False,
520 | keep=False, profile=None):
521 | session = get_session(token, dry_run=dry_run)
522 | if batch_size > count:
523 | batch_size = count
524 | # First figure out how many existing hub-stress-test users there are since
525 | # that will determine our starting index for names.
526 | existing_users = find_existing_stress_test_users(endpoint, session)
527 | # Create the users in batches.
528 | users = create_users(count, batch_size, endpoint, session,
529 | existing_users=existing_users)
530 | # Now that we've created the users, start a server for each in batches.
531 | start_servers(users, endpoint, session, profile=profile)
532 | # Now that all servers are starting we need to poll until they are ready.
533 | # Note that because of the concurrent_spawn_limit in the hub we could be
534 | # waiting awhile. We could also be waiting in case the auto-scaler needs to
535 | # add more nodes.
536 | wait_for_servers_to_start(users, endpoint, session)
537 | # If we don't need to keep the users/servers then remove them.
538 | if not keep:
539 | # Flatten the list of lists so we delete all users in a single run.
540 | usernames = [username for usernames in users for username in usernames]
541 | LOG.info('Deleting %d users', len(usernames))
542 | if not delete_users(usernames, endpoint, session, batch_size):
543 | raise Exception('Failed to delete all users')
544 |
545 |
546 | @timeit
547 | def purge_users(token, endpoint, dry_run=False):
548 | session = get_session(token, dry_run=dry_run)
549 | users = find_existing_stress_test_users(endpoint, session)
550 | if users:
551 | usernames = [user['name'] for user in users]
552 | LOG.info('Deleting %d users', len(usernames))
553 | if not delete_users(usernames, endpoint, session):
554 | raise Exception('Failed to delete all users')
555 |
556 |
557 | @timeit
558 | def notebook_activity_test(count, token, endpoint, workers, keep=False,
559 | dry_run=False):
560 | if count < workers:
561 | workers = count
562 | session = get_session(token=token, dry_run=dry_run, pool_maxsize=workers)
563 |
564 | # First figure out how many existing hub-stress-test users there are since
565 | # that will determine our starting index for names.
566 | existing_users = find_existing_stress_test_users(endpoint, session)
567 |
568 | usernames = [user['name'] for user in existing_users]
569 |
570 | # Create the missing users.
571 | to_create = count - len(existing_users)
572 | if to_create > 0:
573 | users = create_users(to_create, to_create, endpoint, session,
574 | existing_users=existing_users)
575 | usernames.extend([name for usernames in users for name in usernames])
576 |
577 | def send_activity(users, endpoint, session):
578 | now = datetime.utcnow() + timedelta(minutes=1)
579 | now = now.isoformat()
580 | body = {
581 | "servers": {
582 | "": {
583 | "last_activity": now,
584 | }
585 | },
586 | "last_activity": now,
587 | }
588 | times = []
589 | for username in users:
590 | time.sleep(random.random())
591 | url = "{}/users/{}/activity".format(endpoint, username)
592 | resp = session.post(
593 | url, data=json.dumps(body), timeout=DEFAULT_TIMEOUT)
594 | total_time = 1 if dry_run else resp.elapsed.total_seconds()
595 | times.append(total_time)
596 | LOG.debug("Sent activity for user %s (%f)", username, total_time)
597 |
598 | return times
599 |
600 | def chunk(users, n):
601 | for i in range(0, len(users), n):
602 | yield users[i:i + n]
603 |
604 | # STOP_PING is used to control the ping_hub function.
605 | STOP_PING = False
606 |
607 | def ping_hub(endpoint, session):
608 | ping_times = []
609 | while not STOP_PING:
610 | resp = session.get("{}/users/{}".format(endpoint, usernames[0]))
611 | total = 1 if dry_run else resp.elapsed.total_seconds()
612 | ping_times.append(total)
613 | LOG.debug("[ping-hub] Fetching user model took %f seconds", total)
614 |
615 | avg = sum(ping_times) / len(ping_times)
616 | LOG.info("Hub ping time: average=%f, min=%f, max=%f",
617 | avg, min(ping_times), max(ping_times))
618 |
619 | LOG.info("Simulating activity updates for %d users", count)
620 | times = []
621 | with futures.ThreadPoolExecutor(max_workers=workers) as executor:
622 | # Launch our 'ping' thread. This will repeatedly hit the API during
623 | # the test and track the timing. We don't need to get the future
624 | # because this thread is controlled via the STOP_PING varaible.
625 | executor.submit(ping_hub, endpoint, session)
626 |
627 | # Give each worker thread an even share of the test users. Each thread
628 | # will iterate over its list of users and POST an activity update. The
629 | # thread will sleep a random amount of time between 0 and 1 seconds
630 | # between users.
631 | future_to_timing = {
632 | executor.submit(send_activity, users, endpoint, session): users
633 | for users in chunk(usernames, len(usernames) // workers)
634 | }
635 | for future in futures.as_completed(future_to_timing):
636 | times.extend(future.result())
637 |
638 | # We only want the ping_hub thread to run while the users are POSTing
639 | # activity updates. Once all futures are completed we can shut down
640 | # the ping thread.
641 | STOP_PING = True
642 |
643 | avg = sum(times) / len(times)
644 | LOG.info("Time to POST activity update: average=%f, min=%f, max=%f",
645 | avg, min(times), max(times))
646 |
647 | if not keep:
648 | delete_users(usernames, endpoint, session)
649 |
650 |
651 | def main():
652 | args = parse_args()
653 | setup_logging(verbose=args.verbose, log_to_file=args.log_to_file,
654 | args=args)
655 | try:
656 | validate(args)
657 | except Exception as e:
658 | LOG.error(e)
659 | sys.exit(1)
660 |
661 | try:
662 | if args.command == 'purge':
663 | purge_users(args.token, args.endpoint, dry_run=args.dry_run)
664 | elif args.command == 'stress-test':
665 | run_stress_test(args.count, args.batch_size, args.token,
666 | args.endpoint, dry_run=args.dry_run,
667 | keep=args.keep, profile=args.profile)
668 | elif args.command == 'activity-stress-test':
669 | notebook_activity_test(args.count, args.token,
670 | args.endpoint, args.workers, keep=args.keep,
671 | dry_run=args.dry_run)
672 | except Exception as e:
673 | LOG.exception(e)
674 | sys.exit(128)
675 |
676 |
677 | if __name__ == "__main__":
678 | main()
679 |
--------------------------------------------------------------------------------
/test-requirements.txt:
--------------------------------------------------------------------------------
1 | flake8
2 |
--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | # tox (https://tox.readthedocs.io/) is a tool for running tests
2 | # in multiple virtualenvs. This configuration file will run the
3 | # test suite on all supported python versions. To use it, "pip install tox"
4 | # and then run "tox" from this directory.
5 |
6 | [tox]
7 | envlist = flake8, hub-stress-test
8 | skipsdist = True
9 |
10 | [testenv]
11 | basepython = python3
12 | whitelist_externals =
13 | cat
14 | rm
15 | deps =
16 | -r{toxinidir}/requirements.txt
17 | -r{toxinidir}/test-requirements.txt
18 |
19 | [testenv:flake8]
20 | commands =
21 | flake8 scripts
22 |
23 | [testenv:hub-stress-test]
24 | basepython = python3.7
25 | setenv =
26 | JUPYTERHUB_API_TOKEN=test
27 | JUPYTERHUB_ENDPOINT=https://notebooks.foo.com/hub/api
28 | commands =
29 | python scripts/hub-stress-test.py -v --dry-run stress-test -c 5
30 | python scripts/hub-stress-test.py -v --dry-run --log-to-file purge.log purge
31 | python scripts/hub-stress-test.py --dry-run activity-stress-test --count 10
32 | cat purge.log
33 | rm purge.log
34 |
--------------------------------------------------------------------------------