├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── pip_package
├── CHANGELOG.md
├── cloud_accelerator_diagnostics
│ ├── __init__.py
│ ├── src
│ │ └── tensorboard_uploader
│ │ │ ├── tensorboard.py
│ │ │ └── uploader.py
│ └── tests
│ │ └── tensorboard_uploader
│ │ ├── tensorboard_test.py
│ │ └── uploader_test.py
└── pyproject.toml
└── tpu_info
├── README.md
├── pyproject.toml
└── tpu_info
├── __init__.py
├── args.py
├── cli.py
├── device.py
├── metrics.py
└── proto
├── __init__.py
└── tpu_metric_service.proto
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 |
16 | # How to contribute
17 |
18 | We'd love to accept your patches and contributions to this project.
19 |
20 | ## Before you begin
21 |
22 | ### Sign our Contributor License Agreement
23 |
24 | Contributions to this project must be accompanied by a
25 | [Contributor License Agreement](https://cla.developers.google.com/about) (CLA).
26 | You (or your employer) retain the copyright to your contribution; this simply
27 | gives us permission to use and redistribute your contributions as part of the
28 | project.
29 |
30 | If you or your current employer have already signed the Google CLA (even if it
31 | was for a different project), you probably don't need to do it again.
32 |
33 | Visit to see your current agreements or to
34 | sign a new one.
35 |
36 | ### Review our community guidelines
37 |
38 | This project follows
39 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/).
40 |
41 | ## Contribution process
42 |
43 | ### Code reviews
44 |
45 | All submissions, including submissions by project members, require review. We
46 | use GitHub pull requests for this purpose. Consult
47 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
48 | information on using pull requests.
49 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
16 | # Cloud Accelerator Diagnostics
17 |
18 | ## Overview
19 | Cloud Accelerator Diagnostics is a library to monitor, debug and profile the workloads running on Cloud accelerators like TPUs and GPUs. Additionally, this library provides a streamlined approach to automatically upload data to Tensorboard Experiments in Vertex AI. The package allows users to create a Tensorboard instance and Experiments in Vertex AI, and upload logs to them.
20 |
21 | ## Installation
22 | To install the Cloud Accelerator Diagnostics package, run the following command:
23 |
24 | ```bash
25 | pip install cloud-accelerator-diagnostics
26 | ```
27 |
28 | ## Automating Uploads to Vertex AI Tensorboard
29 | Before creating and uploading logs to Vertex AI Tensorboard, you must enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console. Also, make sure to assign the [Vertex AI User IAM role](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.user) to the service account that will call the APIs in `cloud-accelerator-diagnostics` package. This is required to create and access the Vertex AI Tensorboard in the Google Cloud console.
30 |
31 | ### Create Vertex AI Tensorboard
32 | To learn about Vertex AI Tensorboard, visit this [page](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction).
33 |
34 | Here is an example script to create a Vertex AI Tensorboard instance with the name `test-instance` in Google Cloud Project `test-project`.
35 |
36 | Note: Vertex AI is available in only [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.
37 |
38 | ```
39 | from cloud_accelerator_diagnostics import tensorboard
40 |
41 | instance_id = tensorboard.create_instance(project="test-project",
42 | location="us-central1",
43 | tensorboard_name="test-instance")
44 | print("Vertex AI Tensorboard created: ", instance_id)
45 | ```
46 |
47 | ### Create Vertex AI Experiment
48 | To learn about Vertex AI Experiments, visit this [page](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).
49 |
50 | The following script will create a Vertex AI Experiment named `test-experiment` in your Google Cloud Project `test-project`. Here's how it handles attaching a Tensorboard instance:
51 |
52 | **Scenario 1: Tensorboard Instance Exist**
53 |
54 | If a Tensorboard instance named `test-instance` already exists in your project, the script will attach it to the new Experiment.
55 |
56 | **Scenario 2: No Tensorboard Instance Present**
57 |
58 | If `test-instance` does not exist, the script will create a new Tensorboard instance with that name and attach it to the Experiment.
59 |
60 | ```
61 | from cloud_accelerator_diagnostics import tensorboard
62 |
63 | instance_id, tensorboard_url = tensorboard.create_experiment(project="test-project",
64 | location="us-central1",
65 | experiment_name="test-experiment",
66 | tensorboard_name="test-instance")
67 |
68 | print("View your Vertex AI Tensorboard here: ", tensorboard_url)
69 | ```
70 |
71 | If a Vertex AI Experiment with the specified name exists, a new one will not be created, and the existing Experiment's URL will be returned.
72 |
73 | Note: You can attach multiple Vertex AI Experiments to a single Vertex AI Tensorboard.
74 |
75 | ### Upload Logs to Vertex AI Tensorboard
76 | The following script will continuously monitor for new data in the directory (`logdir`), and uploads it to your Vertex AI Tensorboard Experiment. Note that after calling `start_upload_to_tensorboard()`, the thread will be kept alive even if an exception is thrown. To ensure the thread gets shut down, put any code after `start_upload_to_tensorboard()` and before `stop_upload_to_tensorboard()` in a `try` block, and call `stop_upload_to_tensorboard()` in `finally` block. This example shows how you can upload the [profile logs](https://jax.readthedocs.io/en/latest/profiling.html#programmatic-capture) collected for your JAX workload on Vertex AI Tensorboard.
77 |
78 | ```
79 | from cloud_accelerator_diagnostics import uploader
80 |
81 | uploader.start_upload_to_tensorboard(project="test-project",
82 | location="us-central1",
83 | experiment_name="test-experiment",
84 | tensorboard_name="test-instance",
85 | logdir="gs://test-directory/testing")
86 | try:
87 | jax.profiler.start_trace("gs://test-directory/testing")
88 |
89 | jax.profiler.stop_trace()
90 | finally:
91 | uploader.stop_upload_to_tensorboard()
92 | ```
93 |
--------------------------------------------------------------------------------
/pip_package/CHANGELOG.md:
--------------------------------------------------------------------------------
1 |
16 | # Changelog
17 |
18 |
33 |
34 | # [0.1.1] - 2024-10-14
35 | * Version 0.1.1 of `cloud-accelerator-diagnostics` PyPI package
36 | * Features:
37 | * Use Vertex AI's continuous uploader directly
38 |
39 | # [0.1.0] - 2024-03-20
40 | * Initial release of `cloud-accelerator-diagnostics` PyPI package
41 | * Features:
42 | * Create a Vertex AI Tensorboard instance in Google Cloud Project
43 | * Create a Vertex AI Experiment in Google Cloud Project
44 | * Automatically upload logs to Vertex AI Tensorboard Experiment
45 |
--------------------------------------------------------------------------------
/pip_package/cloud_accelerator_diagnostics/__init__.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | from cloud_accelerator_diagnostics.src.tensorboard_uploader import tensorboard
16 | from cloud_accelerator_diagnostics.src.tensorboard_uploader import uploader
17 |
--------------------------------------------------------------------------------
/pip_package/cloud_accelerator_diagnostics/src/tensorboard_uploader/tensorboard.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Tensorboard module.
16 |
17 | This module provides the functionality to create Tensorboard instance and
18 | Experiment in Vertex AI.
19 | """
20 |
21 | import logging
22 |
23 | from google.cloud.aiplatform import aiplatform
24 |
25 |
26 | logger = logging.getLogger(__name__)
27 |
28 | # The API URI for accessing the Tensorboard UI
29 | WEB_SERVER_URI = "tensorboard.googleusercontent.com"
30 |
31 |
32 | def create_instance(project, location, tensorboard_name):
33 | """Creates a new Tensorboard instance in Vertex AI.
34 |
35 | Args:
36 | project (str): Google Cloud Project to create the Tensorboard instance to.
37 | location (str): Location to create the Tensorboard instance to. See
38 | https://cloud.google.com/vertex-ai/docs/general/locations#available-regions
39 | for the list of available VertexAI locations.
40 | tensorboard_name (str): The user-defined name of the Tensorboard. The name
41 | can be up to 128 characters long and can be consist of any UTF-8
42 | characters.
43 |
44 | Returns:
45 | str: The Tensorboard instance identifier.
46 | """
47 | try:
48 | aiplatform.init(project=project, location=location)
49 | tensorboard_identifiers = get_instance_identifiers(tensorboard_name)
50 | if not tensorboard_identifiers:
51 | # create a new Tensorboard instance if an instance doesn't exist
52 | logger.info(
53 | "Creating a Tensorboard instance with the name: %s", tensorboard_name
54 | )
55 | tensorboard = aiplatform.Tensorboard.create(
56 | display_name=tensorboard_name,
57 | project=project,
58 | location=location,
59 | )
60 | return tensorboard.name
61 | else:
62 | logger.info(
63 | "Tensorboard instance with the name: %s already exist in project: %s"
64 | " and location: %s. Not creating a new Tensorboard instance.",
65 | tensorboard_name,
66 | project,
67 | location,
68 | )
69 | # return the first Tensorboard instance even if multiple instances exist
70 | return tensorboard_identifiers[0]
71 | except (ValueError, Exception):
72 | logger.exception("Error while creating Tensorboard instance.")
73 | return None
74 |
75 |
76 | def create_experiment(project, location, experiment_name, tensorboard_name):
77 | """Creates a new Tensorboard Experiment in VertexAI.
78 |
79 | Args:
80 | project (str): Google Cloud Project to create the Tensorboard experiment to.
81 | location (str): Location to create the Tensorboard experiment to. See
82 | https://cloud.google.com/vertex-ai/docs/general/locations#available-regions
83 | for the list of available VertexAI locations.
84 | experiment_name (str): The name of the Tensorboard experiment to create.
85 | This value should be 1-128 characters, and valid characters are
86 | /[a-z][0-9]-/.
87 | tensorboard_name (str): The name of the Tensorboard to create the
88 | Tensorboard Experiment in.
89 |
90 | Returns:
91 | str: The Tensorboard instance identifier.
92 | str: The URL to access the Tensorboard UI.
93 | """
94 | try:
95 | aiplatform.init(project=project, location=location)
96 |
97 | # Get the identifier for the Tensorboard instance. If no Tensorboard
98 | # instance is present, then create a new instance.
99 | tensorboard_identifiers = get_instance_identifiers(tensorboard_name)
100 | if not tensorboard_identifiers:
101 | logger.info(
102 | "No Tensorboard instance present in the project: %s. Creating"
103 | " a new Tensorboard instance with the name: %s",
104 | project,
105 | tensorboard_name,
106 | )
107 | tensorboard_id = create_instance(project, location, tensorboard_name)
108 | # create_instance() failed to create a Tensorboard instance
109 | if tensorboard_id is None:
110 | return None, None
111 | else:
112 | # get the first Tensorboard instance even if multiple instances exist
113 | tensorboard_id = tensorboard_identifiers[0]
114 |
115 | # check if an experiment already exist for the tensorboard_id
116 | experiment = get_experiment(tensorboard_id, experiment_name)
117 | if experiment is not None:
118 | logger.info(
119 | "Experiment with the name: %s already exist in the project: %s."
120 | " Not creating a new Experiment.",
121 | experiment_name,
122 | project,
123 | )
124 | else:
125 | logger.info(
126 | "Creating Experiment for Tensorboard instance id: %s", tensorboard_id
127 | )
128 | experiment = aiplatform.TensorboardExperiment.create(
129 | tensorboard_experiment_id=experiment_name,
130 | display_name=experiment_name,
131 | tensorboard_name=tensorboard_id,
132 | )
133 | experiment_resource_name = experiment.resource_name
134 | tensorboard_url = "https://{}.{}/experiment/{}".format(
135 | location,
136 | WEB_SERVER_URI,
137 | experiment_resource_name.replace("/", "+"),
138 | )
139 | return tensorboard_id, tensorboard_url
140 | except (ValueError, Exception):
141 | logger.exception("Error while creating Tensorboard Experiment.")
142 | return None, None
143 |
144 |
145 | def get_instance_identifiers(tensorboard_name):
146 | """Retrieves a list of Tensorboard instance identifiers that match the given `tensorboard_name`.
147 |
148 | Args:
149 | tensorboard_name (str): The name of the Tensorboard instance to search for.
150 |
151 | Returns:
152 | list: A list of Tensorboard instance identifiers that match
153 | `tensorboard_name`.
154 | """
155 | tensorboard_instances = aiplatform.tensorboard.Tensorboard.list()
156 | tensorboard_identifiers = []
157 | for tensorboard in tensorboard_instances:
158 | if tensorboard.display_name == tensorboard_name:
159 | tensorboard_identifiers.append(tensorboard.name)
160 | return tensorboard_identifiers
161 |
162 |
163 | def get_experiment(tensorboard_id, experiment_name):
164 | """Retrieves the experiment object if an experiment with the given `experiment_name` exists for the given `tensorboard_id`.
165 |
166 | Args:
167 | tensorboard_id (str): The id of Tensorboard instance.
168 | experiment_name (str): The name of Tensorboard experiment.
169 |
170 | Returns:
171 | TensorboardExperiment object if an experiment with the given name exist
172 | in the project, None otherwise.
173 | """
174 | experiment_list = aiplatform.tensorboard.TensorboardExperiment.list(
175 | tensorboard_id
176 | )
177 | for experiment in experiment_list:
178 | if experiment.display_name == experiment_name:
179 | return experiment
180 | return None
181 |
--------------------------------------------------------------------------------
/pip_package/cloud_accelerator_diagnostics/src/tensorboard_uploader/uploader.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Uploader module.
16 |
17 | This module provides the functionality to upload data to Tensorboard in Vertex
18 | AI.
19 | """
20 |
21 | import logging
22 |
23 | from cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader import tensorboard
24 | from google.cloud.aiplatform import aiplatform
25 |
26 |
27 | logger = logging.getLogger(__name__)
28 |
29 |
30 | def start_upload_to_tensorboard(
31 | project,
32 | location,
33 | experiment_name,
34 | tensorboard_name,
35 | logdir,
36 | ):
37 | """Continues to listen for new data in the logdir and uploads when it appears.
38 |
39 | Note that after calling `start_upload_to_tensorboard()`, thread will be kept
40 | alive even if an exception is thrown. To ensure the thread gets shut down, put
41 | any code after `start_upload_to_tensorboard()` and before
42 | `stop_upload_to_tensorboard()` in a `try` statement, and call
43 | `stop_upload_to_tensorboard()` in finally.
44 |
45 | Sample usage:
46 | ```
47 | start_upload_to_tensorboard(project='test-project'
48 | location='us-central1',
49 | experiment_name='test-experiment',
50 | tensorboard_name='test-instance',
51 | logdir='test-logdir')
52 | try:
53 | # your code here
54 | finally:
55 | stop_upload_to_tensorboard()
56 | ```
57 |
58 | Args:
59 | project (str): Google Cloud Project that has the Tensorboard instance.
60 | location (str): Location where Tensorboard instance is present.
61 | experiment_name (str): The name of the Tensorboard experiment.
62 | tensorboard_name (str): The name of the Tensorboard instance.
63 | logdir (str): path of the log directory to upload to Tensorboard.
64 | """
65 | try:
66 | aiplatform.init(project=project, location=location)
67 |
68 | # Skip uploading logs to VertexAI if a Tensorboard instance doesn't exist
69 | tensorboard_identifiers = tensorboard.get_instance_identifiers(
70 | tensorboard_name
71 | )
72 | if not tensorboard_identifiers:
73 | logger.error(
74 | "No Tensorboard instance with the name %s present in the project %s."
75 | " Skipping uploading logs to VertexAI.",
76 | tensorboard_name,
77 | project,
78 | )
79 | return
80 | else:
81 | # get the first Tensorboard instance even if multiple instances exist
82 | tensorboard_id = tensorboard_identifiers[0]
83 |
84 | # Skip uploading logs to VertexAI if a Tensorboard experiment doesn't exist
85 | experiment = tensorboard.get_experiment(tensorboard_id, experiment_name)
86 | if experiment is None:
87 | logger.error(
88 | "No Tensorboard experiment with the name %s present in the project"
89 | " %s. Skipping uploading logs to VertexAI.",
90 | experiment_name,
91 | project,
92 | )
93 | return
94 |
95 | start_upload(tensorboard_id, experiment_name, logdir)
96 | except (ValueError, Exception) as e:
97 | logger.exception(
98 | "Error while uploading logs to Tensorboard. This will not impact the"
99 | " workload. Error: %s",
100 | e,
101 | )
102 |
103 |
104 | def stop_upload_to_tensorboard():
105 | """Stops the thread created by `start_upload_to_tensorboard()`."""
106 | logger.info("Logs will no longer be uploaded to Tensorboard.")
107 | aiplatform.end_upload_tb_log()
108 |
109 |
110 | def start_upload(tensorboard_id, experiment_name, logdir):
111 | """Starts uploading logs to Tensorboard instance in VertexAI.
112 |
113 | Args:
114 | tensorboard_id (str): The id of Tensorboard instance.
115 | experiment_name (str): The name of the Tensorboard experiment.
116 | logdir (str): path of the log directory to upload to Tensorboard.
117 | """
118 | logger.info("Starting uploading of logs to Tensorboard.")
119 | try:
120 | aiplatform.start_upload_tb_log(
121 | tensorboard_id=tensorboard_id,
122 | tensorboard_experiment_name=experiment_name,
123 | logdir=logdir,
124 | )
125 | except Exception as e:
126 | logger.exception(
127 | "Error while uploading logs to Tensorboard. This will not impact the"
128 | " workload. Error: %s",
129 | e,
130 | )
131 |
--------------------------------------------------------------------------------
/pip_package/cloud_accelerator_diagnostics/tests/tensorboard_uploader/tensorboard_test.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | from absl.testing import absltest
16 | from cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader import tensorboard
17 |
18 |
19 | class TensorboardTest(absltest.TestCase):
20 |
21 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard.create")
22 | @absltest.mock.patch(
23 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
24 | )
25 | def testCreateInstanceWhenNoInstanceExist(
26 | self, mock_tensorboard_list, mock_tensorboard_create
27 | ):
28 | mock_tensorboard_list.return_value = []
29 | mock_tensorboard_create.return_value.name = "123"
30 |
31 | instance_id = tensorboard.create_instance(
32 | "test-project", "us-central1", "test-instance"
33 | )
34 |
35 | mock_tensorboard_list.assert_called_once()
36 | mock_tensorboard_create.assert_called_once_with(
37 | project="test-project",
38 | location="us-central1",
39 | display_name="test-instance",
40 | )
41 | self.assertEqual(instance_id, "123")
42 |
43 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard")
44 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard.create")
45 | @absltest.mock.patch(
46 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
47 | )
48 | def testCreateInstanceWhenSameNameInstanceExist(
49 | self, mock_tensorboard_list, mock_tensorboard_create, mock_tensorboard
50 | ):
51 | mock_tensorboard_instance = mock_tensorboard.return_value
52 | mock_tensorboard_instance.display_name = "test-instance"
53 | mock_tensorboard_list.return_value = [mock_tensorboard_instance]
54 |
55 | instance_id = tensorboard.create_instance(
56 | "test-project", "us-central1", "test-instance"
57 | )
58 |
59 | mock_tensorboard_list.assert_called_once()
60 | mock_tensorboard_create.assert_not_called()
61 | self.assertEqual(instance_id, mock_tensorboard_instance.name)
62 |
63 | def testCreateInstanceForUnsupportedRegion(self):
64 | with self.assertLogs(level="ERROR") as log:
65 | instance_id = tensorboard.create_instance(
66 | "test-project", "us-central2", "test-instance"
67 | )
68 |
69 | self.assertRegex(
70 | log.output[0], "ValueError: Unsupported region for Vertex AI"
71 | )
72 | self.assertIsNone(instance_id)
73 |
74 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard.create")
75 | @absltest.mock.patch(
76 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
77 | )
78 | def testCreateInstanceWhenExceptionIsThrown(
79 | self, mock_tensorboard_list, mock_tensorboard_create
80 | ):
81 | mock_tensorboard_list.return_value = []
82 | mock_tensorboard_create.return_value = Exception("Exception is thrown...")
83 |
84 | with self.assertLogs(level="ERROR"):
85 | instance_id = tensorboard.create_instance(
86 | "test-project", "us-central1", "test-instance"
87 | )
88 |
89 | mock_tensorboard_list.assert_called_once()
90 | mock_tensorboard_create.assert_called_once_with(
91 | project="test-project",
92 | location="us-central1",
93 | display_name="test-instance",
94 | )
95 | self.assertIsNone(instance_id)
96 |
97 | @absltest.mock.patch(
98 | "google.cloud.aiplatform.aiplatform.tensorboard.TensorboardExperiment.list"
99 | )
100 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard")
101 | @absltest.mock.patch(
102 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
103 | )
104 | @absltest.mock.patch(
105 | "google.cloud.aiplatform.aiplatform.TensorboardExperiment.create"
106 | )
107 | def testCreateExperimentWhenTensorboardInstanceExist(
108 | self,
109 | mock_experiment_create,
110 | mock_tensorboard_list,
111 | mock_tensorboard,
112 | mock_experiment_list,
113 | ):
114 | mock_tensorboard_instance = mock_tensorboard.return_value
115 | mock_tensorboard_instance.display_name = "test-instance"
116 | mock_tensorboard_instance.name = "123"
117 | mock_tensorboard_list.return_value = [mock_tensorboard_instance]
118 | mock_experiment_list.return_value = []
119 | expected_resource_name = "projects/770040921623/locations/us-central1/tensorboards/123/experiments/test-experiment"
120 | mock_experiment_create.return_value.resource_name = expected_resource_name
121 | expected_tensorboard_url = (
122 | "https://us-central1.tensorboard.googleusercontent.com/experiment/"
123 | + expected_resource_name.replace("/", "+")
124 | )
125 |
126 | instance_id, tensorboard_url = tensorboard.create_experiment(
127 | "test-project", "us-central1", "test-experiment", "test-instance"
128 | )
129 |
130 | mock_tensorboard_list.assert_called_once()
131 | mock_experiment_create.assert_called_once_with(
132 | tensorboard_experiment_id="test-experiment",
133 | tensorboard_name="123",
134 | display_name="test-experiment",
135 | )
136 | self.assertEqual(instance_id, "123")
137 | self.assertEqual(tensorboard_url, expected_tensorboard_url)
138 |
139 | @absltest.mock.patch(
140 | "google.cloud.aiplatform.aiplatform.tensorboard.TensorboardExperiment.list"
141 | )
142 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard.create")
143 | @absltest.mock.patch(
144 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
145 | )
146 | @absltest.mock.patch(
147 | "google.cloud.aiplatform.aiplatform.TensorboardExperiment.create"
148 | )
149 | def testCreateExperimentWhenNoTensorboardInstanceExist(
150 | self,
151 | mock_experiment_create,
152 | mock_tensorboard_list,
153 | mock_tensorboard_create,
154 | mock_experiment_list,
155 | ):
156 | mock_tensorboard_list.return_value = []
157 | mock_tensorboard_create.return_value.name = "123"
158 | mock_experiment_list.return_value = []
159 | expected_resource_name = "projects/770040921623/locations/us-central1/tensorboards/123/experiments/test-experiment"
160 | mock_experiment_create.return_value.resource_name = expected_resource_name
161 | expected_tensorboard_url = (
162 | "https://us-central1.tensorboard.googleusercontent.com/experiment/"
163 | + expected_resource_name.replace("/", "+")
164 | )
165 |
166 | instance_id, tensorboard_url = tensorboard.create_experiment(
167 | "test-project", "us-central1", "test-experiment", "test-instance"
168 | )
169 |
170 | mock_tensorboard_list.assert_called()
171 | mock_tensorboard_create.assert_called_once_with(
172 | project="test-project",
173 | location="us-central1",
174 | display_name="test-instance",
175 | )
176 | mock_experiment_create.assert_called_once_with(
177 | tensorboard_experiment_id="test-experiment",
178 | tensorboard_name="123",
179 | display_name="test-experiment",
180 | )
181 | self.assertEqual(instance_id, "123")
182 | self.assertEqual(tensorboard_url, expected_tensorboard_url)
183 |
184 | @absltest.mock.patch(
185 | "google.cloud.aiplatform.aiplatform.TensorboardExperiment"
186 | )
187 | @absltest.mock.patch(
188 | "google.cloud.aiplatform.aiplatform.tensorboard.TensorboardExperiment.list"
189 | )
190 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard")
191 | @absltest.mock.patch(
192 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
193 | )
194 | @absltest.mock.patch(
195 | "google.cloud.aiplatform.aiplatform.TensorboardExperiment.create"
196 | )
197 | def testCreateExperimentWhenTensorboardInstanceAndExperimentExist(
198 | self,
199 | mock_experiment_create,
200 | mock_tensorboard_list,
201 | mock_tensorboard,
202 | mock_experiment_list,
203 | mock_experiment,
204 | ):
205 | mock_tensorboard_instance = mock_tensorboard.return_value
206 | mock_tensorboard_instance.display_name = "test-instance"
207 | mock_tensorboard_instance.name = "123"
208 | mock_tensorboard_list.return_value = [mock_tensorboard_instance]
209 | expected_resource_name = "projects/770040921623/locations/us-central1/tensorboards/123/experiments/test-experiment"
210 | expected_tensorboard_url = (
211 | "https://us-central1.tensorboard.googleusercontent.com/experiment/"
212 | + expected_resource_name.replace("/", "+")
213 | )
214 | mock_experiment_instance = mock_experiment.return_value
215 | mock_experiment_instance.display_name = "test-experiment"
216 | mock_experiment_instance.resource_name = expected_resource_name
217 | mock_experiment_list.return_value = [mock_experiment_instance]
218 |
219 | instance_id, tensorboard_url = tensorboard.create_experiment(
220 | "test-project", "us-central1", "test-experiment", "test-instance"
221 | )
222 |
223 | mock_tensorboard_list.assert_called_once()
224 | mock_experiment_create.assert_not_called()
225 | self.assertEqual(instance_id, "123")
226 | self.assertEqual(tensorboard_url, expected_tensorboard_url)
227 |
228 | def testCreateExperimentForUnsupportedRegion(self):
229 | with self.assertLogs(level="ERROR") as log:
230 | instance_id, tensorboard_url = tensorboard.create_experiment(
231 | "test-project", "us-central2", "test-experiment", "test-instance"
232 | )
233 |
234 | self.assertRegex(
235 | log.output[0], "ValueError: Unsupported region for Vertex AI"
236 | )
237 | self.assertIsNone(instance_id)
238 | self.assertIsNone(tensorboard_url)
239 |
240 | @absltest.mock.patch("google.cloud.aiplatform.aiplatform.Tensorboard.create")
241 | @absltest.mock.patch(
242 | "google.cloud.aiplatform.aiplatform.tensorboard.Tensorboard.list"
243 | )
244 | @absltest.mock.patch(
245 | "google.cloud.aiplatform.aiplatform.TensorboardExperiment.create"
246 | )
247 | def testCreateExperimentWhenCreateInstanceFails(
248 | self,
249 | mock_experiment_create,
250 | mock_tensorboard_list,
251 | mock_tensorboard_create,
252 | ):
253 | mock_tensorboard_list.return_value = []
254 | mock_tensorboard_create.return_value = Exception("Exception is thrown...")
255 |
256 | with self.assertLogs(level="ERROR") as log:
257 | instance_id, tensorboard_url = tensorboard.create_experiment(
258 | "test-project", "us-central1", "test-experiment", "test-instance"
259 | )
260 |
261 | mock_tensorboard_list.assert_called()
262 | mock_tensorboard_create.assert_called_once_with(
263 | project="test-project",
264 | location="us-central1",
265 | display_name="test-instance",
266 | )
267 | mock_experiment_create.assert_not_called()
268 | self.assertRegex(
269 | log.output[0], "Error while creating Tensorboard instance."
270 | )
271 | self.assertIsNone(instance_id)
272 | self.assertIsNone(tensorboard_url)
273 |
274 |
275 | if __name__ == "__main__":
276 | absltest.main()
277 |
--------------------------------------------------------------------------------
/pip_package/cloud_accelerator_diagnostics/tests/tensorboard_uploader/uploader_test.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | import threading
16 |
17 | from absl.testing import absltest
18 | from cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader import uploader
19 |
20 |
21 | class UploaderTest(absltest.TestCase):
22 |
23 | @absltest.mock.patch(
24 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.tensorboard"
25 | )
26 | @absltest.mock.patch(
27 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.aiplatform"
28 | )
29 | def testWhenUploadToTensorboardThenVertexUploaderIsCalled(
30 | self,
31 | mock_aiplatform,
32 | mock_tensorboard,
33 | ):
34 | # given
35 | mock_tensorboard.get_instance_identifiers.return_value = ["test_experiment"]
36 | mock_tensorboard.get_experiment.return_value = "test-experiment"
37 |
38 | # when
39 | uploader.start_upload_to_tensorboard(
40 | "test-project",
41 | "us-central1",
42 | "test-experiment",
43 | "test-instance",
44 | "logdir",
45 | )
46 |
47 | # then
48 | mock_aiplatform.init.assert_called_once_with(
49 | project="test-project", location="us-central1"
50 | )
51 | mock_aiplatform.start_upload_tb_log.assert_called_once_with(
52 | tensorboard_id="test_experiment",
53 | tensorboard_experiment_name="test-experiment",
54 | logdir="logdir",
55 | )
56 |
57 | @absltest.mock.patch(
58 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.tensorboard"
59 | )
60 | @absltest.mock.patch(
61 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.aiplatform"
62 | )
63 | def testWhenNoTensorboardExistsThenVertexUploaderNotCalled(
64 | self,
65 | mock_aiplatform,
66 | mock_tensorboard,
67 | ):
68 | # given
69 | mock_tensorboard.get_instance_identifiers.return_value = []
70 |
71 | # when
72 | with self.assertLogs(level="ERROR") as log:
73 | uploader.start_upload_to_tensorboard(
74 | "test-project",
75 | "us-central1",
76 | "test-experiment",
77 | "test-instance",
78 | "logdir",
79 | )
80 |
81 | # then
82 | self.assertEqual(threading.active_count(), 1)
83 | self.assertRegex(
84 | log.output[0],
85 | "No Tensorboard instance with the name test-instance present in the"
86 | " project test-project.",
87 | )
88 | mock_aiplatform.init.assert_called_once_with(
89 | project="test-project", location="us-central1"
90 | )
91 | mock_aiplatform.start_upload_tb_log.assert_not_called()
92 |
93 | @absltest.mock.patch(
94 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.tensorboard"
95 | )
96 | @absltest.mock.patch(
97 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.aiplatform"
98 | )
99 | def testWhenNoExperimentExistsThenVertexUploaderNotCalled(
100 | self,
101 | mock_aiplatform,
102 | mock_tensorboard,
103 | ):
104 | # given
105 | mock_tensorboard.get_instance_identifiers.return_value = ["test_experiment"]
106 | mock_tensorboard.get_experiment.return_value = None
107 |
108 | # when
109 | with self.assertLogs(level="ERROR") as log:
110 | uploader.start_upload_to_tensorboard(
111 | "test-project",
112 | "us-central1",
113 | "test-experiment",
114 | "test-instance",
115 | "logdir",
116 | )
117 |
118 | # then
119 | self.assertRegex(
120 | log.output[0],
121 | "No Tensorboard experiment with the name test-experiment present in"
122 | " the project test-project.",
123 | )
124 | mock_aiplatform.init.assert_called_once_with(
125 | project="test-project", location="us-central1"
126 | )
127 | mock_aiplatform.start_upload_tb_log.assert_not_called()
128 |
129 | @absltest.mock.patch(
130 | "cloud_accelerator_diagnostics.pip_package.cloud_accelerator_diagnostics.src.tensorboard_uploader.uploader.aiplatform"
131 | )
132 | def testWhenStopUploadToTensorboardIsCalledThenVertexUploadIsStopped(
133 | self,
134 | mock_aiplatform,
135 | ):
136 | # when
137 | uploader.stop_upload_to_tensorboard()
138 |
139 | # then
140 | mock_aiplatform.end_upload_tb_log.assert_called_once()
141 |
142 |
143 | if __name__ == "__main__":
144 | absltest.main()
145 |
--------------------------------------------------------------------------------
/pip_package/pyproject.toml:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | [project]
16 | name = "cloud-accelerator-diagnostics"
17 | version = "0.1.1"
18 | authors = [
19 | { name="Cloud TPU Team", email="cloud-tpu-eng@google.com" },
20 | ]
21 | description = "Monitor, debug and profile the jobs running on Cloud accelerators like TPUs and GPUs."
22 | readme = "README.md"
23 | requires-python = ">=3.8"
24 | license = {text = "Apache-2.0"}
25 | classifiers = [
26 | "Programming Language :: Python :: 3.8",
27 | "Programming Language :: Python :: 3.9",
28 | "Programming Language :: Python :: 3.10",
29 | "Programming Language :: Python :: 3.11",
30 | ]
31 | keywords = []
32 |
33 | # pip dependencies installed with `pip install -e .`
34 | dependencies = [
35 | "google-cloud-aiplatform[tensorboard]"
36 | ]
37 |
38 | [project.urls]
39 | "Homepage" = "https://github.com/google/cloud-accelerator-diagnostics"
40 | "Bug Tracker" = "https://github.com/google/cloud-accelerator-diagnostics/issues"
41 |
42 | [build-system]
43 | # Build system specify which backend is used to build/install the project
44 | requires = ["flit_core >=3.8,<4"]
45 | build-backend = "flit_core.buildapi"
46 |
47 | [tool.flit.sdist]
48 | # Flit specific options (files to exclude from the PyPI package)
49 | exclude = [
50 | # Do not release tests files on PyPI
51 | "tests/*_test.py",
52 | ]
53 |
--------------------------------------------------------------------------------
/tpu_info/README.md:
--------------------------------------------------------------------------------
1 |
16 | # `tpu-info` CLI
17 |
18 | `tpu-info` is a simple CLI tool for detecting Cloud TPU devices and reading
19 | runtime metrics from `libtpu`, including memory usage.
20 |
21 | Note: to access `libtpu` utilization metrics, you must have a workload running
22 | with a supported ML framework, such as JAX or PyTorch/XLA. See the
23 | [Usage](#usage) section for more information.
24 |
25 | ## Installing
26 |
27 | Install the latest release using `pip`:
28 |
29 | ```
30 | pip install tpu-info
31 | ```
32 |
33 | Alternatively, install `tpu-info` from source:
34 |
35 | ```bash
36 | pip install git+https://github.com/google/cloud-accelerator-diagnostics/#subdirectory=tpu_info
37 | ```
38 |
39 | ## Usage
40 |
41 | To view current TPU utilization data, `tpu-info` requires a running TPU workload
42 | with supported ML framework[^1] such as JAX or PyTorch/XLA. For example:
43 |
44 | ```
45 | # JAX
46 | >>> import jax
47 | >>> jax.device_count()
48 | 4
49 | # Create a tensor on the TPU
50 | >>> t = jax.numpy.ones((300, 300))
51 |
52 | # PyTorch/XLA
53 | >>> import torch
54 | >>> import torch_xla
55 | >>> t = torch.randn((300, 300), device=torch_xla.device())
56 | ```
57 |
58 | Then, on the same machine, run the `tpu-info` command line tool:
59 |
60 | ```bash
61 | $ tpu-info
62 | TPU Chips
63 | ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
64 | ┃ Chip ┃ Type ┃ Devices ┃ PID ┃
65 | ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
66 | │ /dev/accel0 │ TPU v4 chip │ 1 │ 130007 │
67 | │ /dev/accel1 │ TPU v4 chip │ 1 │ 130007 │
68 | │ /dev/accel2 │ TPU v4 chip │ 1 │ 130007 │
69 | │ /dev/accel3 │ TPU v4 chip │ 1 │ 130007 │
70 | └─────────────┴─────────────┴─────────┴────────┘
71 | Connected to libtpu at grpc://localhost:8431...
72 | TPU Utilization
73 | ┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
74 | ┃ Device ┃ Memory usage ┃ Duty cycle ┃
75 | ┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
76 | │ 0 │ 0.00 GiB / 31.75 GiB │ 0.00% │
77 | │ 1 │ 0.00 GiB / 31.75 GiB │ 0.00% │
78 | │ 2 │ 0.00 GiB / 31.75 GiB │ 0.00% │
79 | │ 3 │ 0.00 GiB / 31.75 GiB │ 0.00% │
80 | └────────┴──────────────────────┴────────────┘
81 | ```
82 |
83 | [^1]: Releases from before 2024 may not be compatible.
84 |
--------------------------------------------------------------------------------
/tpu_info/pyproject.toml:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | [build-system]
16 | requires = ["hatchling", "hatch-build-scripts", "grpcio-tools~=1.65.5"]
17 | build-backend = "hatchling.build"
18 |
19 | [project]
20 | version = "0.3.1"
21 | name = "tpu-info"
22 | dependencies = [
23 | # `grpcio` should match `grpcio-tools` build dependency
24 | "grpcio>=1.65.5",
25 | "protobuf",
26 | "rich",
27 | ]
28 | authors = [
29 | { name="Cloud TPU Team", email="cloud-tpu-eng@google.com" },
30 | ]
31 | description = "CLI tool to view TPU metrics"
32 | readme = "README.md"
33 | license = {text = "Apache-2.0"}
34 | requires-python = ">=3.8"
35 |
36 | [project.urls]
37 | homepage = "https://github.com/google/cloud-accelerator-diagnostics/tree/main/tpu_info"
38 | repository = "https://github.com/google/cloud-accelerator-diagnostics"
39 |
40 | [project.optional-dependencies]
41 | test = [
42 | "absl-py",
43 | ]
44 |
45 | [project.scripts]
46 | tpu-info = "tpu_info.cli:print_chip_info"
47 |
48 | [tool.hatch.build]
49 | exclude = [
50 | "*_test.py",
51 | "*.proto",
52 | ]
53 |
54 | [tool.hatch.build.targets.wheel]
55 | # HACK: Avoid copying files generated below
56 | # See https://github.com/rmorshea/hatch-build-scripts/discussions/4
57 | artifacts = [
58 | "tpu_metric_service_pb2.py",
59 | "tpu_metric_service_pb2.pyi",
60 | "tpu_metric_service_pb2_grpc.py",
61 | ]
62 |
63 | [[tool.hatch.build.hooks.build-scripts.scripts]]
64 | commands = [
65 | # Look up proto from current directory to ensure imports use `tpu_info`
66 | # package (e.g. `from tpu_info.proto import ...`)
67 | # See protoc bug: https://github.com/protocolbuffers/protobuf/issues/1491
68 | "python -m grpc_tools.protoc -I. tpu_info/proto/tpu_metric_service.proto --python_out=. --pyi_out=. --grpc_python_out=.",
69 | ]
70 | artifacts = []
71 | clean_artifacts = false
72 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/__init__.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | from . import cli
16 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/args.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Argument parsing for the tpu-info tool."""
16 |
17 | import argparse
18 |
19 |
20 | def parse_arguments():
21 | """Parses command line arguments for the tpu-info tool."""
22 | parser = argparse.ArgumentParser(
23 | description="Display TPU info and metrics.",
24 | formatter_class=argparse.RawTextHelpFormatter,
25 | )
26 | parser.add_argument(
27 | "--streaming",
28 | action="store_true",
29 | help="Enable streaming mode to refresh metrics continuously",
30 | )
31 | parser.add_argument(
32 | "--rate",
33 | type=float,
34 | default=1.0,
35 | help=(
36 | "Refresh rate in seconds for streaming mode (default: 1.0; effective"
37 | " when streaming is implemented)."
38 | ),
39 | )
40 | return parser.parse_args()
41 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/cli.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Defines command line interface for `tpu-info` tool.
16 |
17 | Top-level functions should be added to `project.scripts` in `pyproject.toml`.
18 | """
19 |
20 | import sys
21 | import time
22 | from typing import Any, List
23 |
24 | from tpu_info import args
25 | from tpu_info import device
26 | from tpu_info import metrics
27 | import grpc
28 | from rich.console import Console, Group
29 | from rich.live import Live
30 | import rich.table
31 |
32 |
33 | def _bytes_to_gib(size: int) -> float:
34 | return size / (1 << 30)
35 |
36 |
37 | # TODO(vidishasethi): b/418938764 - Modularize by extracting
38 | # each table's rendering logic into its own dedicated helper function.
39 | def _fetch_and_render_tables(chip_type: Any, count: int):
40 | """Fetches all TPU data and prepares a list of Rich Table objects for display."""
41 | renderables: List[rich.table.Table] = []
42 |
43 | table = rich.table.Table(title="TPU Chips", title_justify="left")
44 | table.add_column("Chip")
45 | table.add_column("Type")
46 | table.add_column("Devices")
47 | # TODO(wcromar): this may not match the libtpu runtime metrics
48 | # table.add_column("HBM (per core)")
49 | table.add_column("PID")
50 |
51 | chip_paths = [device.chip_path(chip_type, index) for index in range(count)]
52 | chip_owners = device.get_chip_owners()
53 |
54 | for chip in chip_paths:
55 | owner = chip_owners.get(chip)
56 |
57 | table.add_row(
58 | chip,
59 | str(chip_type),
60 | str(chip_type.value.devices_per_chip),
61 | str(owner),
62 | )
63 |
64 | renderables.append(table)
65 |
66 | table = rich.table.Table(
67 | title="TPU Runtime Utilization", title_justify="left"
68 | )
69 | table.add_column("Device")
70 | table.add_column("HBM usage")
71 | table.add_column("Duty cycle", justify="right")
72 |
73 | try:
74 | device_usage = metrics.get_chip_usage(chip_type)
75 | except grpc.RpcError as e:
76 | if e.code() == grpc.StatusCode.UNAVAILABLE: # pytype: disable=attribute-error
77 | print(
78 | "WARNING: Libtpu metrics unavailable. Is there a framework using the"
79 | " TPU? See"
80 | " https://github.com/google/cloud-accelerator-diagnostics/tree/main/tpu_info"
81 | " for more information"
82 | )
83 | else:
84 | print(f"ERROR: {e}")
85 |
86 | device_usage = [metrics.Usage(i, -1, -1, -1) for i in range(count)]
87 |
88 | # TODO(wcromar): take alternative ports as a flag
89 | print("Connected to libtpu at grpc://localhost:8431...")
90 | for chip in device_usage:
91 | if chip.memory_usage < 0:
92 | memory_usage = "N/A"
93 | else:
94 | memory_usage = (
95 | f"{_bytes_to_gib(chip.memory_usage):.2f} GiB /"
96 | f" {_bytes_to_gib(chip.total_memory):.2f} GiB"
97 | )
98 | if chip.duty_cycle_pct < 0:
99 | duty_cycle_pct = "N/A"
100 | else:
101 | duty_cycle_pct = f"{chip.duty_cycle_pct:.2f}%"
102 | table.add_row(
103 | str(chip.device_id),
104 | memory_usage,
105 | duty_cycle_pct
106 | if chip_type.value.devices_per_chip == 1 or chip.device_id % 2 == 0
107 | else "",
108 | )
109 |
110 | renderables.append(table)
111 |
112 | table = rich.table.Table(title="TensorCore Utilization", title_justify="left")
113 | table.add_column("Chip ID")
114 | table.add_column("TensorCore Utilization", justify="right")
115 |
116 | try:
117 | # pylint: disable=g-import-not-at-top
118 | from libtpu import sdk # pytype: disable=import-error
119 |
120 | tensorcore_util_data = sdk.monitoring.get_metric("tensorcore_util").data()
121 | except ImportError as e:
122 | print(f"WARNING: ImportError: {e}.")
123 | except AttributeError as e:
124 | print(
125 | f"WARNING: {e}. Please check if the latest libtpu is used"
126 | )
127 | except RuntimeError as e:
128 | print(
129 | f"WARNING: {e}. Please check if the latest vbar control agent is used."
130 | )
131 | else:
132 | for i in range(len(tensorcore_util_data)):
133 | tc_data = f"{tensorcore_util_data[i]}%"
134 | table.add_row(
135 | str(i),
136 | tc_data,
137 | )
138 | renderables.append(table)
139 |
140 | table = rich.table.Table(
141 | title="TPU Buffer Transfer Latency", title_justify="left"
142 | )
143 | table.add_column("Buffer Size")
144 | table.add_column("P50", justify="right")
145 | table.add_column("P90", justify="right")
146 | table.add_column("P95", justify="right")
147 | table.add_column("P999", justify="right")
148 |
149 | try:
150 | buffer_transfer_latency_distributions = (
151 | metrics.get_buffer_transfer_latency()
152 | )
153 | except grpc.RpcError as e:
154 | if e.code() == grpc.StatusCode.UNAVAILABLE: # pytype: disable=attribute-error
155 | print(
156 | "WARNING: Buffer Transfer Latency metrics unavailable. Did you start"
157 | " a MULTI_SLICE workload with"
158 | " `TPU_RUNTIME_METRICS_PORTS=8431,8432,8433,8434`?"
159 | )
160 | else:
161 | print(f"ERROR: {e}")
162 |
163 | buffer_transfer_latency_distributions = []
164 |
165 | for distribution in buffer_transfer_latency_distributions:
166 | table.add_row(
167 | distribution.buffer_size,
168 | f"{distribution.p50:.2f} us",
169 | f"{distribution.p90:.2f} us",
170 | f"{distribution.p95:.2f} us",
171 | f"{distribution.p999:.2f} us",
172 | )
173 | renderables.append(table)
174 |
175 | return renderables
176 |
177 |
178 | def print_chip_info():
179 | """Print local TPU devices and libtpu runtime metrics."""
180 | cli_args = args.parse_arguments()
181 | # TODO(wcromar): Merge all of this info into one table
182 | chip_type, count = device.get_local_chips()
183 | if not chip_type:
184 | print("No TPU chips found.")
185 | return
186 |
187 | if cli_args.streaming:
188 | if cli_args.rate <= 0:
189 | print("Error: Refresh rate must be positive.", file=sys.stderr)
190 | return
191 |
192 | print(
193 | f"Starting streaming mode (refresh rate: {cli_args.rate}s). Press"
194 | " Ctrl+C to exit."
195 | )
196 |
197 | try:
198 | renderables = _fetch_and_render_tables(chip_type, count)
199 |
200 | if not renderables and chip_type:
201 | print(
202 | "No data tables could be generated. Exiting streaming.",
203 | file=sys.stderr,
204 | )
205 | return
206 |
207 | render_group = Group(*renderables)
208 |
209 | with Live(
210 | render_group,
211 | refresh_per_second=4,
212 | screen=True,
213 | vertical_overflow="visible",
214 | ) as live:
215 | while True:
216 | time.sleep(cli_args.rate)
217 | new_renderables = _fetch_and_render_tables(chip_type, count)
218 | live.update(Group(*new_renderables))
219 | except KeyboardInterrupt:
220 | print("\nExiting streaming mode.")
221 | except Exception as e:
222 | import traceback
223 |
224 | print(
225 | f"\nAn unexpected error occurred in streaming mode: {e}",
226 | file=sys.stderr,
227 | )
228 | traceback.print_exc(file=sys.stderr)
229 | sys.exit(1)
230 |
231 | else:
232 | renderables = _fetch_and_render_tables(chip_type, count)
233 |
234 | if renderables:
235 | console_obj = Console()
236 | for item in renderables:
237 | console_obj.print(item)
238 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/device.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Utilities for detecting locally-attached TPU devices."""
16 |
17 | import collections
18 | import enum
19 | import glob
20 | import os
21 | import pathlib
22 | import re
23 | import typing
24 | from typing import Dict, Literal, Optional, Tuple
25 |
26 | GOOGLE_PCI_VENDOR_ID = "0x1ae0"
27 |
28 |
29 | class TpuChip(enum.Enum):
30 | """TPU chip versions and basic specs."""
31 |
32 | class Info(typing.NamedTuple):
33 | """Specs for a specific TPU chip version."""
34 |
35 | name: str
36 | hbm_gib: int
37 | devices_per_chip: Literal[1, 2]
38 |
39 | V2 = Info("v2", hbm_gib=8, devices_per_chip=2)
40 | V3 = Info("v3", hbm_gib=16, devices_per_chip=2)
41 | V4 = Info("v4", hbm_gib=32, devices_per_chip=1)
42 | V5E = Info("v5e", hbm_gib=16, devices_per_chip=1)
43 | V5P = Info("v5p", hbm_gib=95, devices_per_chip=1)
44 | V6E = Info("v6e", hbm_gib=32, devices_per_chip=1)
45 |
46 | @classmethod
47 | def from_pci_device_id(
48 | cls, device_id: str, subsystem_id: str
49 | ) -> Optional["TpuChip"]:
50 | """Returns TPU chip type for given PCI IDs, or None if not a TPU device."""
51 | # TPU v2 and v3 share a device ID
52 | if device_id == "0x0027":
53 | if subsystem_id == "0x004e":
54 | return cls.V2
55 | elif subsystem_id == "0x004f":
56 | return cls.V3
57 |
58 | device_id_to_device = {
59 | "0x005e": cls.V4,
60 | "0x0063": cls.V5E,
61 | "0x0062": cls.V5P,
62 | "0x006f": cls.V6E,
63 | }
64 |
65 | return device_id_to_device.get(device_id)
66 |
67 | def __str__(self):
68 | """Human-readable name of TPU chip type."""
69 | return f"TPU {self.value.name} chip"
70 |
71 |
72 | def get_local_chips() -> Tuple[Optional[TpuChip], int]:
73 | """Returns the type and number of TPU chips available."""
74 | count = collections.Counter()
75 | for pci_path in glob.glob("/sys/bus/pci/devices/*"):
76 | vendor_path = os.path.join(pci_path, "vendor")
77 | vendor_id = pathlib.Path(vendor_path).read_text().strip()
78 | if vendor_id != GOOGLE_PCI_VENDOR_ID:
79 | continue
80 |
81 | device_id_path = os.path.join(pci_path, "device")
82 | device_id = pathlib.Path(device_id_path).read_text().strip()
83 | subsystem_path = os.path.join(pci_path, "subsystem_device")
84 | subsystem_id = pathlib.Path(subsystem_path).read_text().strip()
85 |
86 | chip_type = TpuChip.from_pci_device_id(device_id, subsystem_id)
87 | if chip_type:
88 | count[chip_type] += 1
89 |
90 | assert len(count) <= 1, f"Expected one chip type, got {count}"
91 | return count.most_common()[0] if count else (None, 0)
92 |
93 |
94 | def chip_path(chip_type: TpuChip, index: int):
95 | """Returns the expected `/dev` path for a given TPU device type."""
96 | if chip_type in [TpuChip.V5E, TpuChip.V5P, TpuChip.V6E]:
97 | return f"/dev/vfio/{index}"
98 | else:
99 | return f"/dev/accel{index}"
100 |
101 |
102 | def get_chip_owners() -> Dict[str, int]:
103 | """Returns a mapping of device paths to PIDs of processes using that device."""
104 | device_owners = {}
105 |
106 | for link in glob.glob("/proc/*/fd/*"):
107 | try:
108 | file = os.readlink(link)
109 | except FileNotFoundError:
110 | continue
111 |
112 | # /dev/accel_ or /dev/vfio/_
113 | if re.fullmatch(r"/dev/(?:accel|vfio/)\d", file):
114 | match = re.fullmatch(r"/proc/(\d+)/fd/\d+", link)
115 | if not match:
116 | raise RuntimeError("Unknown link pattern", link)
117 |
118 | device_owners[file] = int(match.group(1))
119 |
120 | return device_owners
121 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/metrics.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Client library for libtpu runtime metrics."""
16 |
17 | import enum
18 | import itertools
19 | import typing
20 | from typing import List
21 |
22 | from tpu_info import device
23 | import grpc
24 |
25 | from tpu_info.proto import tpu_metric_service_pb2 as tpu_metrics
26 | from tpu_info.proto import tpu_metric_service_pb2_grpc as tpu_metrics_grpc
27 |
28 |
29 | class MetricName(enum.Enum):
30 | """Metric names defined in libtpu."""
31 |
32 | TOTAL_MEMORY = "tpu.runtime.hbm.memory.total.bytes"
33 | MEMORY_USAGE = "tpu.runtime.hbm.memory.usage.bytes"
34 | DUTY_CYCLE_PCT = "tpu.runtime.tensorcore.dutycycle.percent"
35 | BUFFER_TRANSFER_LATENCY_US = (
36 | "megascale.dcn_transfer_latencies.microsecond.cumulative.distribution"
37 | )
38 |
39 |
40 | class Usage(typing.NamedTuple):
41 | """Usage measurements for a TPU device."""
42 |
43 | device_id: int
44 | memory_usage: int
45 | total_memory: int
46 | duty_cycle_pct: float
47 |
48 |
49 | class BufferTransferLatencyDistribution(typing.NamedTuple):
50 | """Distribution measurements."""
51 |
52 | buffer_size: str
53 | p50: float
54 | p90: float
55 | p95: float
56 | p999: float
57 |
58 |
59 | def get_chip_usage(
60 | chip_type: device.TpuChip, addr: str = "localhost:8431"
61 | ) -> List[Usage]:
62 | """Gets usage statistics for all attached TPU devices.
63 |
64 | Args:
65 | chip_type: TPU chip version. Determines how metrics are interpreted.
66 | addr: GRPC address of libtpu metrics server.
67 |
68 | Returns:
69 | List of usage statistics for each TPU device.
70 | """
71 | channel = grpc.secure_channel(addr, grpc.local_channel_credentials())
72 | client = tpu_metrics_grpc.RuntimeMetricServiceStub(channel)
73 |
74 | def sorted_metric_response(
75 | metric_name: MetricName,
76 | ) -> List[tpu_metrics.Metric]:
77 | # Manually annotate type until GRPC supports annotations
78 | # See https://github.com/grpc/grpc/issues/29041
79 | resp: tpu_metrics.MetricResponse = client.GetRuntimeMetric(
80 | tpu_metrics.MetricRequest(metric_name=metric_name.value)
81 | )
82 | return sorted(resp.metric.metrics, key=lambda m: m.attribute.value.int_attr)
83 |
84 | totals = sorted_metric_response(MetricName.TOTAL_MEMORY)
85 | usages = sorted_metric_response(MetricName.MEMORY_USAGE)
86 | duty_cycle_pct = sorted_metric_response(MetricName.DUTY_CYCLE_PCT)
87 |
88 | # Duty cycle is always measured per-chip, while memory is measured per-core.
89 | # Repeat if necessary so these responses are the same length.
90 | duty_cycle_pct_per_core = list(
91 | itertools.chain.from_iterable(
92 | itertools.repeat(d, chip_type.value.devices_per_chip)
93 | for d in duty_cycle_pct
94 | )
95 | )
96 |
97 | assert (
98 | len(totals) == len(usages) == len(duty_cycle_pct_per_core)
99 | ), "Metrics not found for all chips"
100 |
101 | return [
102 | Usage(
103 | u.attribute.value.int_attr,
104 | u.gauge.as_int,
105 | t.gauge.as_int,
106 | d.gauge.as_double,
107 | )
108 | for u, t, d in zip(usages, totals, duty_cycle_pct_per_core)
109 | ]
110 |
111 |
112 | def _get_percentile(
113 | percentile_count: int,
114 | total_count: int,
115 | buckets: List[int],
116 | scale: float,
117 | growth_factor: float,
118 | ) -> float:
119 | """Gets a percentile value from a distribution."""
120 | for i in range(len(buckets) - 1, 0, -1):
121 | total_count -= buckets[i]
122 | if total_count <= percentile_count:
123 | delta = percentile_count - total_count
124 | lower_bound = scale * (growth_factor ** (i - 1))
125 | return lower_bound * (1 + (delta / buckets[i]) * (growth_factor - 1))
126 | return 1
127 |
128 |
129 | def get_buffer_transfer_latency(
130 | addr: str = "localhost:8431",
131 | ) -> List[BufferTransferLatencyDistribution]:
132 | """Gets buffer transfer latency statistics for all attached TPU devices.
133 |
134 | Args:
135 | addr: GRPC address of libtpu metrics server.
136 |
137 | Returns:
138 | List of buffer transfer latency statistics for each TPU device.
139 | """
140 | channel = grpc.secure_channel(addr, grpc.local_channel_credentials())
141 | client = tpu_metrics_grpc.RuntimeMetricServiceStub(channel)
142 |
143 | resp: tpu_metrics.MetricResponse = client.GetRuntimeMetric(
144 | tpu_metrics.MetricRequest(
145 | metric_name=MetricName.BUFFER_TRANSFER_LATENCY_US.value
146 | )
147 | )
148 |
149 | buffer_transfer_latency_distributions = []
150 |
151 | for metric in resp.metric.metrics:
152 | attribute = metric.attribute
153 | distribution = metric.distribution
154 | bucket = list(distribution.bucket_counts)
155 | count = distribution.count
156 | scale = distribution.bucket_options.exponential_buckets.scale
157 | growth_factor = (
158 | distribution.bucket_options.exponential_buckets.growth_factor
159 | )
160 |
161 | p50_count = int(count * 0.5)
162 | p90_count = int(count * 0.9)
163 | p95_count = int(count * 0.95)
164 | p999_count = int(count * 0.999)
165 |
166 | p50 = _get_percentile(p50_count, count, bucket, scale, growth_factor)
167 | p90 = _get_percentile(p90_count, count, bucket, scale, growth_factor)
168 | p95 = _get_percentile(p95_count, count, bucket, scale, growth_factor)
169 | p999 = _get_percentile(p999_count, count, bucket, scale, growth_factor)
170 |
171 | buffer_transfer_latency_distributions.append(
172 | BufferTransferLatencyDistribution(
173 | attribute.value.kvlist_attr.attributes[0].value.string_attr,
174 | p50,
175 | p90,
176 | p95,
177 | p999,
178 | )
179 | )
180 |
181 | return buffer_transfer_latency_distributions
182 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/proto/__init__.py:
--------------------------------------------------------------------------------
1 | # Copyright 2023 Google LLC
2 | #
3 | # Licensed under the Apache License, Version 2.0 (the "License");
4 | # you may not use this file except in compliance with the License.
5 | # You may obtain a copy of the License at
6 | #
7 | # https://www.apache.org/licenses/LICENSE-2.0
8 | #
9 | # Unless required by applicable law or agreed to in writing, software
10 | # distributed under the License is distributed on an "AS IS" BASIS,
11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 | # See the License for the specific language governing permissions and
13 | # limitations under the License.
14 |
15 | """Contains generated proto definitions for libtpu metrics service."""
16 |
--------------------------------------------------------------------------------
/tpu_info/tpu_info/proto/tpu_metric_service.proto:
--------------------------------------------------------------------------------
1 | syntax = "proto3";
2 |
3 | package tpu.monitoring.runtime;
4 |
5 | import "google/protobuf/timestamp.proto";
6 |
7 |
8 |
9 | option java_multiple_files = true;
10 | option objc_class_prefix = "GRPC";
11 | option java_package = "com.google.tpu.monitoring.runtime.service.proto";
12 |
13 |
14 | message Exemplar {
15 | double value = 1;
16 | .google.protobuf.Timestamp timestamp = 2;
17 | repeated Attribute attributes = 3;
18 | }
19 |
20 | message Distribution {
21 | int64 count = 1;
22 | double mean = 2;
23 | double min = 3;
24 | double max = 4;
25 | double sum_of_squared_deviation = 5;
26 |
27 | message BucketOptions {
28 | oneof options {
29 | Regular regular_buckets = 1 [deprecated = true];
30 | Exponential exponential_buckets = 2;
31 | Explicit explicit_buckets = 3;
32 | Linear linear_buckets = 4;
33 | }
34 | message Regular {
35 | option deprecated = true;
36 |
37 | int32 num_finite_buckets = 1;
38 | // A linear distribution has only one bound with overall width and offset
39 | // of the lowest bucket.
40 | // An explicit distribution will have monotonically increasing buckets
41 | // with width and the offset from the previous bucket.
42 | repeated Bound bounds = 2;
43 | }
44 | message Exponential {
45 | // Must be greater than 0.
46 | int32 num_finite_buckets = 1;
47 | // Must be greater than 1.
48 | double growth_factor = 2;
49 | // Must be greater than 0.
50 | double scale = 3;
51 | }
52 | message Bound {
53 | option deprecated = true;
54 |
55 | double width = 1;
56 | double offset = 2;
57 | }
58 |
59 | // Specifies a linear sequence of buckets that all have the same width
60 | // (except overflow and underflow). Each bucket represents a constant
61 | // absolute uncertainty on the specific value in the bucket.
62 | //
63 | // There are `num_finite_buckets + 2` (= N) buckets. Bucket `i` has the
64 | // following boundaries:
65 | //
66 | // Upper bound (0 <= i < N-1): offset + (width * i).
67 | //
68 | // Lower bound (1 <= i < N): offset + (width * (i - 1)).
69 | message Linear {
70 | // Must be greater than 0.
71 | int32 num_finite_buckets = 1;
72 |
73 | // Must be greater than 0.
74 | double width = 2;
75 |
76 | // Lower bound of the first bucket.
77 | double offset = 3;
78 | }
79 |
80 | // Specifies a set of buckets with arbitrary widths.
81 | //
82 | // There are `size(bounds) + 1` (= N) buckets. Bucket `i` has the following
83 | // boundaries:
84 | //
85 | // Upper bound (0 <= i < N-1): bounds[i]
86 | // Lower bound (1 <= i < N); bounds[i - 1]
87 | //
88 | // The `bounds` field must contain at least one element. If `bounds` has
89 | // only one element, then there are no finite buckets, and that single
90 | // element is the common boundary of the overflow and underflow buckets.
91 | message Explicit {
92 | // The values must be monotonically increasing.
93 | repeated double bounds = 1;
94 | }
95 | }
96 |
97 | // Defines the histogram bucket boundaries.
98 | BucketOptions bucket_options = 6;
99 | repeated int64 bucket_counts = 7;
100 | repeated Exemplar exemplars = 8;
101 | }
102 |
103 | // Gauge represents a single-point measure.
104 | message Gauge {
105 | oneof value {
106 | double as_double = 1;
107 | int64 as_int = 2;
108 | string as_string = 3;
109 | bool as_bool = 4;
110 | }
111 | }
112 |
113 | // Counter is a monotonically increasing measure (until reset to zero).
114 | message Counter {
115 | // The value MUST not be negative.
116 | oneof value {
117 | double as_double = 1;
118 | uint64 as_int = 2;
119 | }
120 | Exemplar exemplar = 3;
121 | }
122 |
123 | // Quantile represents the value at a given quantile of a distribution.
124 | message Quantile {
125 | // The quantile of a distribution. Must be in the interval [0.0, 1.0].
126 | double quantile = 1;
127 | // The value at the given quantile of a distribution.
128 | // Quantile values must NOT be negative.
129 | double value = 2;
130 | }
131 |
132 | // Summary represents observed sampling for different quantiles including
133 | // sum of all the observations and total count of observations.
134 | message Summary {
135 | uint64 sample_count = 1;
136 | double sample_sum = 2;
137 | repeated Quantile quantile = 3;
138 | }
139 |
140 | // AttrValue represents an attribute value.
141 | // AttrValue is considered to be "empty" if all values are unspecified.
142 | message AttrValue {
143 | oneof attr {
144 | string string_attr = 1;
145 | bool bool_attr = 2;
146 | int64 int_attr = 3;
147 | double double_attr = 4;
148 | ArrayAttrValue array_attr = 5;
149 | KeyValueList kvlist_attr = 6;
150 | bytes bytes_attr = 7;
151 | }
152 | }
153 |
154 | // ArrayAttrValue is a list of AttrValue messages.
155 | message ArrayAttrValue {
156 | // Array of attribute. The array may be empty (contain 0 elements).
157 | repeated AttrValue attrs = 1;
158 | }
159 |
160 | // KeyValueList is a list of Key-AttrValue messages.
161 | message KeyValueList {
162 | // A collection of key/value attributes. The list may be empty.
163 | // The keys in attributes MUST be unique.
164 | repeated Attribute attributes = 1;
165 | }
166 |
167 | // Attribute is a key-value pair to store the attributes of a metric.
168 | // For example, device-id of the metric, host-id of the metric.
169 | message Attribute {
170 | string key = 1;
171 | AttrValue value = 2;
172 | }
173 |
174 | // Metric represents a metric datapoint.
175 | // A metric has a reporting time, attribute and a measure value.
176 | message Metric {
177 | Attribute attribute = 1;
178 | .google.protobuf.Timestamp timestamp = 2;
179 | oneof measure {
180 | Gauge gauge = 3;
181 | Counter counter = 4;
182 | Distribution distribution = 5;
183 | Summary summary = 6;
184 | }
185 | }
186 |
187 | // TPUMetric is a standalone metric object, exposed externally to a consumer.
188 | message TPUMetric {
189 | string name = 1;
190 | string description = 2;
191 | repeated Metric metrics = 3;
192 | }
193 |
194 | // MetricRequest is the request object to fetch metrics from LibTPU.
195 | // MetricRequest contains the metric name with which metrics can be fetched
196 | // from the RuntimeMetricsService.GetRuntimeMetric.
197 | message MetricRequest {
198 | string metric_name = 1;
199 | // skip_node_aggregation provides options to the client to skip aggregated
200 | // lookup of metrics for a worker node. If the field is unset or set as false,
201 | // an aggregated view of metrics for a TPU worker node would be provided.
202 | // The aggregation feature is enabled by libTPU during initialization.
203 | // By default, the worker node aggregation would be turned on in libTPU if the
204 | // metrics server is supported. If the libTPU initialization turns off the
205 | // feature explicitly, then the aggregated view would not be provided.
206 | bool skip_node_aggregation = 2;
207 | }
208 |
209 | // MetricResponse is the response object for RuntimeService.GetRuntimeMetric.
210 | // The response contains the TPUMetric as response which holds the metric data
211 | // for the requested metric.
212 | message MetricResponse {
213 | TPUMetric metric = 1;
214 | }
215 |
216 | // ListSupportedMetricsRequest is the request object for
217 | // RuntimeService.ListSupportedMetrics.
218 | // Empty request means no filters. All the metrics supported from the LibTPU
219 | // would be returned as the response.
220 | message ListSupportedMetricsRequest {
221 | // A regex filter to apply to the supported metrics.
222 | // If the field is empty or not set, no filter is applied. All the supported
223 | // metrics are returned.
224 | //
225 | // Example: `.*memory.*`, `.*memory.*|.*duty_cycle.*`
226 | string filter = 1;
227 | }
228 |
229 | message SupportedMetric {
230 | string metric_name = 1;
231 | }
232 |
233 | // ListSupportedMetricsResponse is the response object for
234 | // RuntimeService.ListSupportedMetrics.
235 | // It contains all the metrics supported in the LibTPU for the
236 | // ListSupportedMetricsRequest.
237 | message ListSupportedMetricsResponse {
238 | // List of supported metric.
239 | repeated SupportedMetric supported_metric = 1;
240 | }
241 |
242 | service RuntimeMetricService {
243 | // GetRuntimeMetric returns the TPU metrics data for the MetricRequest.
244 | rpc GetRuntimeMetric(MetricRequest) returns (MetricResponse);
245 |
246 | // ListSupportedMetrics lists the supported metrics for
247 | // ListSupportedMetricsRequest.
248 | rpc ListSupportedMetrics(ListSupportedMetricsRequest)
249 | returns (ListSupportedMetricsResponse);
250 | }
251 |
--------------------------------------------------------------------------------