├── reps
    ├── README.md
    ├── user-story.png
    ├── 2025-11-24-ray-token-auth
    │   ├── ray_auth_flow.png
    │   └── ray_auth_with_k8s_architecture.png
    ├── 2023-12-04-accelerated-dag-figures
    │   ├── image1.png
    │   ├── image2.png
    │   ├── image3.png
    │   ├── image4.png
    │   └── image5.png
    ├── 2024-10-18-train-tune-api-revamp
    │   ├── train_architecture.png
    │   ├── train_tune_decoupled.png
    │   ├── train_tune_dependency.png
    │   ├── train_tune_interop_after.png
    │   └── train_tune_interop_before.png
    ├── 2025-11-21-ray-history-server
    │   ├── events_file_structure.png
    │   ├── history_server_architecture.png
    │   └── 2025-11-21-ray-history-server.md
    ├── 2024-05-21-ray-kubectl-plugin.md
    ├── 2023-04-27-data-strict-mode.md
    ├── 2022-03-09-shuffle.md
    ├── 2024-06-16-support-apache-yunikorn-scheduler.md
    ├── 2022-03-08-serve_pipeline.md
    ├── 2024-05-21-kuberay-authentication.md
    ├── 2023-07-08-air-surface-syntax.md
    ├── 2023-03-15-train-api.md
    ├── 2022-12-08-ray-for-federated-learning-and-privacy-preserving-computing.md
    ├── 2022-10-11-serve-java-http-ingress.md
    ├── 2023-04-28-remove-algorithms-from-rllib.md
    ├── 2023-03-20-air-storage-path.md
    ├── 2022-09-19-ray-on-spark.md
    ├── 2023-08-18-serve-java-dag-api.md
    ├── 2023-8-18-ray-on-spark-autoscaling.md
    ├── 2023-06-06-simplify-sync.md
    ├── 2022-04-21-state-observability-apis.md
    ├── 2025-03-18-label-based-scheduling.md
    └── 2023-10-13-accelerator-support.md
├── README.md
├── .gitignore
└── LICENSE


/reps/README.md:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/reps/user-story.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/user-story.png


--------------------------------------------------------------------------------
/reps/2025-11-24-ray-token-auth/ray_auth_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2025-11-24-ray-token-auth/ray_auth_flow.png


--------------------------------------------------------------------------------
/reps/2023-12-04-accelerated-dag-figures/image1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2023-12-04-accelerated-dag-figures/image1.png


--------------------------------------------------------------------------------
/reps/2023-12-04-accelerated-dag-figures/image2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2023-12-04-accelerated-dag-figures/image2.png


--------------------------------------------------------------------------------
/reps/2023-12-04-accelerated-dag-figures/image3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2023-12-04-accelerated-dag-figures/image3.png


--------------------------------------------------------------------------------
/reps/2023-12-04-accelerated-dag-figures/image4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2023-12-04-accelerated-dag-figures/image4.png


--------------------------------------------------------------------------------
/reps/2023-12-04-accelerated-dag-figures/image5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2023-12-04-accelerated-dag-figures/image5.png


--------------------------------------------------------------------------------
/reps/2024-10-18-train-tune-api-revamp/train_architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2024-10-18-train-tune-api-revamp/train_architecture.png


--------------------------------------------------------------------------------
/reps/2025-11-21-ray-history-server/events_file_structure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2025-11-21-ray-history-server/events_file_structure.png


--------------------------------------------------------------------------------
/reps/2024-10-18-train-tune-api-revamp/train_tune_decoupled.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2024-10-18-train-tune-api-revamp/train_tune_decoupled.png


--------------------------------------------------------------------------------
/reps/2024-10-18-train-tune-api-revamp/train_tune_dependency.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2024-10-18-train-tune-api-revamp/train_tune_dependency.png


--------------------------------------------------------------------------------
/reps/2024-10-18-train-tune-api-revamp/train_tune_interop_after.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2024-10-18-train-tune-api-revamp/train_tune_interop_after.png


--------------------------------------------------------------------------------
/reps/2025-11-21-ray-history-server/history_server_architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2025-11-21-ray-history-server/history_server_architecture.png


--------------------------------------------------------------------------------
/reps/2025-11-24-ray-token-auth/ray_auth_with_k8s_architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2025-11-24-ray-token-auth/ray_auth_with_k8s_architecture.png


--------------------------------------------------------------------------------
/reps/2024-10-18-train-tune-api-revamp/train_tune_interop_before.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ray-project/enhancements/HEAD/reps/2024-10-18-train-tune-api-revamp/train_tune_interop_before.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Ray Enhancement Proposals
 2 | This repo tracks Ray Enhancement Proposals (REP). The REP process is the main way to propose, discuss, and decide on features and other major changes to the Ray project. We'll start with a simple decision-making process (and evolve it over time):.
 3 | - First, a draft PR is created against the repo with a draft REP. A senior Ray committer should be designated as the shepherd in the Stewardship section and assigned to the PR.
 4 | - The shepherd will review the PR and get it into a polished state for further review by Ray committers.
 5 | - Once the PR is reviewable, we will hold a vote on the ``ray-committers`` mailing list. In most cases this should reach consensus; if the result is not unanimous, Eric Liang (@ericl) and Philipp Moritz (@pcmoritz) will be the final deciders on whether to accept the change.
 6 | - Based on the results of the vote and possible final decision, the PR will either be merged (REP approved) or closed (REP rejected) with a short summary of the decision.
 7 | 
 8 | You can find a list of PRs for REPs here (both open and merged PRs are available for comment): https://github.com/ray-project/enhancements/pulls?q=is%3Apr
 9 | 
10 | Each REP should include the following information:
11 | ## Summary
12 | ### General Motivation
13 | What use cases is this proposal supposed to enhance. If possible, please include details like the environment and scale.
14 | ### Should this change be within `ray` or outside?
15 | From a software layering perspective, should this change be part of the main `ray` project, part of an ecosystem project under `ray-project`, or a new ecosystem project?
16 | 
17 | When reviewing the REP, the reviewers and the shepherd should apply the following judgements:
18 | - If an author proposes a change to be within the `ray` repo, the reviewers and the shepherd should assess whether the change can be layered on top of `ray` instead. 
19 | If so we should try to make the change in a separate repo. 
20 | - For a change proposed as an ecosystem project under `ray-project`: the reviewers and the shepherd should make sure that the technical quality
21 | meets the bar of (at least) a good "experimental" or "alpha" feature -- we should be comfortable welcoming Ray users with similar use cases to try this project.
22 | - For a change proposed as a new ecosystem project (outside of `ray-project`): then this REP is just serving as a "request for comments". 
23 | We don't need to go through the voting process, since it's not Ray committers' decision to approve the change. 
24 | 
25 | ## Stewardship
26 | ### Required Reviewers
27 | The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers. 
28 | ### Shepherd of the Proposal (should be a senior committer)
29 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
30 | 
31 | ## Design and Architecture
32 | The proposal should include sufficient technical details for reviewers to determine the anticipated benefits and risks.
33 | 
34 | ## Compatibility, Deprecation, and Migration Plan
35 | An important part of the proposal is to explicitly point out any compability implications of the proposed change. If there is any, we should thouroughly discuss a plan to deprecate existing APIs and migration to the new one(s).
36 | 
37 | ## Test Plan and Acceptance Criteria
38 | The proposal should discuss how the change will be tested **before** it can be merged or enabled. It should also include other acceptance criteria including documentation and examples. 
39 | 
40 | ## (Optional) Follow-on Work
41 | Optionally, the proposal should discuss necessary follow-on work after the change is accepted.
42 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | ## Core latex/pdflatex auxiliary files:
  2 | *.aux
  3 | *.lof
  4 | *.log
  5 | *.lot
  6 | *.fls
  7 | *.out
  8 | *.toc
  9 | *.fmt
 10 | *.fot
 11 | *.cb
 12 | *.cb2
 13 | .*.lb
 14 | 
 15 | ## Intermediate documents:
 16 | *.dvi
 17 | *.xdv
 18 | *-converted-to.*
 19 | # these rules might exclude image files for figures etc.
 20 | # *.ps
 21 | # *.eps
 22 | # *.pdf
 23 | 
 24 | ## Generated if empty string is given at "Please type another file name for output:"
 25 | .pdf
 26 | 
 27 | ## Bibliography auxiliary files (bibtex/biblatex/biber):
 28 | *.bbl
 29 | *.bcf
 30 | *.blg
 31 | *-blx.aux
 32 | *-blx.bib
 33 | *.run.xml
 34 | 
 35 | ## Build tool auxiliary files:
 36 | *.fdb_latexmk
 37 | *.synctex
 38 | *.synctex(busy)
 39 | *.synctex.gz
 40 | *.synctex.gz(busy)
 41 | *.pdfsync
 42 | 
 43 | ## Build tool directories for auxiliary files
 44 | # latexrun
 45 | latex.out/
 46 | 
 47 | ## Auxiliary and intermediate files from other packages:
 48 | # algorithms
 49 | *.alg
 50 | *.loa
 51 | 
 52 | # achemso
 53 | acs-*.bib
 54 | 
 55 | # amsthm
 56 | *.thm
 57 | 
 58 | # beamer
 59 | *.nav
 60 | *.pre
 61 | *.snm
 62 | *.vrb
 63 | 
 64 | # changes
 65 | *.soc
 66 | 
 67 | # comment
 68 | *.cut
 69 | 
 70 | # cprotect
 71 | *.cpt
 72 | 
 73 | # elsarticle (documentclass of Elsevier journals)
 74 | *.spl
 75 | 
 76 | # endnotes
 77 | *.ent
 78 | 
 79 | # fixme
 80 | *.lox
 81 | 
 82 | # feynmf/feynmp
 83 | *.mf
 84 | *.mp
 85 | *.t[1-9]
 86 | *.t[1-9][0-9]
 87 | *.tfm
 88 | 
 89 | #(r)(e)ledmac/(r)(e)ledpar
 90 | *.end
 91 | *.?end
 92 | *.[1-9]
 93 | *.[1-9][0-9]
 94 | *.[1-9][0-9][0-9]
 95 | *.[1-9]R
 96 | *.[1-9][0-9]R
 97 | *.[1-9][0-9][0-9]R
 98 | *.eledsec[1-9]
 99 | *.eledsec[1-9]R
100 | *.eledsec[1-9][0-9]
101 | *.eledsec[1-9][0-9]R
102 | *.eledsec[1-9][0-9][0-9]
103 | *.eledsec[1-9][0-9][0-9]R
104 | 
105 | # glossaries
106 | *.acn
107 | *.acr
108 | *.glg
109 | *.glo
110 | *.gls
111 | *.glsdefs
112 | *.lzo
113 | *.lzs
114 | 
115 | # uncomment this for glossaries-extra (will ignore makeindex's style files!)
116 | # *.ist
117 | 
118 | # gnuplottex
119 | *-gnuplottex-*
120 | 
121 | # gregoriotex
122 | *.gaux
123 | *.gtex
124 | 
125 | # htlatex
126 | *.4ct
127 | *.4tc
128 | *.idv
129 | *.lg
130 | *.trc
131 | *.xref
132 | 
133 | # hyperref
134 | *.brf
135 | 
136 | # knitr
137 | *-concordance.tex
138 | # TODO Comment the next line if you want to keep your tikz graphics files
139 | *.tikz
140 | *-tikzDictionary
141 | 
142 | # listings
143 | *.lol
144 | 
145 | # luatexja-ruby
146 | *.ltjruby
147 | 
148 | # makeidx
149 | *.idx
150 | *.ilg
151 | *.ind
152 | 
153 | # minitoc
154 | *.maf
155 | *.mlf
156 | *.mlt
157 | *.mtc[0-9]*
158 | *.slf[0-9]*
159 | *.slt[0-9]*
160 | *.stc[0-9]*
161 | 
162 | # minted
163 | _minted*
164 | *.pyg
165 | 
166 | # morewrites
167 | *.mw
168 | 
169 | # nomencl
170 | *.nlg
171 | *.nlo
172 | *.nls
173 | 
174 | # pax
175 | *.pax
176 | 
177 | # pdfpcnotes
178 | *.pdfpc
179 | 
180 | # sagetex
181 | *.sagetex.sage
182 | *.sagetex.py
183 | *.sagetex.scmd
184 | 
185 | # scrwfile
186 | *.wrt
187 | 
188 | # sympy
189 | *.sout
190 | *.sympy
191 | sympy-plots-for-*.tex/
192 | 
193 | # pdfcomment
194 | *.upa
195 | *.upb
196 | 
197 | # pythontex
198 | *.pytxcode
199 | pythontex-files-*/
200 | 
201 | # tcolorbox
202 | *.listing
203 | 
204 | # thmtools
205 | *.loe
206 | 
207 | # TikZ & PGF
208 | *.dpth
209 | *.md5
210 | *.auxlock
211 | 
212 | # todonotes
213 | *.tdo
214 | 
215 | # vhistory
216 | *.hst
217 | *.ver
218 | 
219 | # easy-todo
220 | *.lod
221 | 
222 | # xcolor
223 | *.xcp
224 | 
225 | # xmpincl
226 | *.xmpi
227 | 
228 | # xindy
229 | *.xdy
230 | 
231 | # xypic precompiled matrices and outlines
232 | *.xyc
233 | *.xyd
234 | 
235 | # endfloat
236 | *.ttt
237 | *.fff
238 | 
239 | # Latexian
240 | TSWLatexianTemp*
241 | 
242 | ## Editors:
243 | # WinEdt
244 | *.bak
245 | *.sav
246 | 
247 | # Texpad
248 | .texpadtmp
249 | 
250 | # LyX
251 | *.lyx~
252 | 
253 | # Kile
254 | *.backup
255 | 
256 | # gummi
257 | .*.swp
258 | 
259 | # KBibTeX
260 | *~[0-9]*
261 | 
262 | # TeXnicCenter
263 | *.tps
264 | 
265 | # auto folder when using emacs and auctex
266 | ./auto/*
267 | *.el
268 | 
269 | # expex forward references with \gathertags
270 | *-tags.tex
271 | 
272 | # standalone packages
273 | *.sta
274 | 
275 | # Makeindex log files
276 | *.lpz
277 | 
278 | # IDE
279 | .idea/*
280 | 


--------------------------------------------------------------------------------
/reps/2024-05-21-ray-kubectl-plugin.md:
--------------------------------------------------------------------------------
 1 | # Ray Kubectl Plugin
 2 | 
 3 | This document introduces a kubectl plugin designed to enhance the interaction with KubeRay resources, simplifying the experience of utilizing Ray on Kubernetes.
 4 | The primary objective of this plugin is to provide a user-friendly interface, enabling users to leverage the benefits of Ray on Kubernetes without needing extensive knowledge of Kubernetes concepts and tools.
 5 | 
 6 | ## Motivation
 7 | 
 8 | Today it is incredibly challenging for data scientists and AI researchers unfamiliar with Kubernetes to start using Ray on Kubernetes.
 9 | However, running Ray on Kubernetes is advantageous or necessary in certain environments. These users struggle with KubeRay for a variety of reasons:
10 | 1. **Unfamiliarity with Kubernetes API and Best Practices**: Users new to Kubernetes may struggle to operate Ray clusters using Kubernetes concepts like Pods and Services due to unfamiliarity with the Kubernetes API and best practices.
11 | 2. **Complex Kubernetes Networking**: Kubernetes networking can be daunting for beginners. Understanding how to use kubectl port-forward or externally exposed Services to connect to Ray clusters can be challenging without prior experience.
12 | 3. **Construction of Kubernetes YAML Manifests**: Creating YAML manifests for RayCluster, RayJob, and RayService can be challenging for those not experienced with using Kubernetes custom resources.
13 | 
14 | For novice Kubernetes users, an intuitive CLI experience can drastically enhance their onboarding journey. A kubectl plugin is proposed since it can seamlessly integrate with the rest of the existing kubectl surface as needed.
15 | The primary goal of this plugin should be user-friendliness and a smooth progression from zero to “something good enough”. For more advanced scenarios and scaling requirements, users can opt to manage KubeRay custom resources
16 | independently as many users do today.
17 | 
18 | More simply, the Ray Kubectl Plugin simplifies the management of Ray clusters by eliminating the need for users to handle complex YAML configurations.
19 | 
20 | Why a kubectl plugin instead of the existing kuberay CLI?
21 | * **Convenience / ease of use**: Kubectl plugins are designed to work within the existing kubectl framework. They inherit the user's current authentication, context, and cluster metadata, eliminating the need for additional setup.
22 | * **Seamless kubectl integration**: Users can effortlessly switch between basic kubectl subcommands and the Ray plugin using a single command-line tool.
23 | * **Accessibility**: The Kuberay CLI needs the KubeRay API server to function. However, the majority of clusters using KubeRay don't run the KubeRay API server, which limits the accessibility of the CLI.
24 | 
25 | ## User Journey
26 | 
27 | ### Create / Manage a Ray cluster
28 | 
29 | ```
30 | $ kubectl ray cluster create my-cluster --ray-version=2.9.3
31 |     --worker-replicas=10 --worker-cpus=8 -worker-memory=32GB --worker-gpus=2
32 | ```
33 | 
34 | ```
35 | kubectl ray cluster scale default-worker-group --cluster my-cluster --replicas=10
36 | ```
37 | 
38 | ```
39 | kubectl ray cluster worker-group create cpu-pool --cluster my-cluster -worker-replicas=10
40 |     --worker-cpus=8 -worker-memory=32GB
41 | 
42 | ```
43 | 
44 | ```
45 | kubectl ray cluster delete my-cluster
46 | ```
47 | 
48 | ### Ray Job Submissions (using RayJob)
49 | 
50 | ```
51 | $ kubectl ray job submit --cluster my-cluster --image=image:tag -- python job.py
52 | ```
53 | 
54 | ### Ray Session - Interactive Client / Dashboard
55 | 
56 | ```
57 | $ kubectl ray cluster session my-cluster
58 | Connecting...
59 | Ray Interactive Client Session started on port 10001...
60 | Ray Dashboard Session started on port 8265...
61 | ```
62 | 
63 | ### Ray Dashboard
64 | 
65 | ```
66 | $ kubectl ray cluster dashboard my-cluster
67 | Opening browser session to Ray dashboard ...
68 | ```
69 | 
70 | ### Ray Logs
71 | 
72 | ```
73 | $ kubectl ray cluster logs my-cluster --out-dir ./log-dir
74 | Downloading Ray logs to ./log-dir ...
75 | ```
76 | 
77 | ## Implementation Details
78 | 
79 | The kubectl plugin will be developed in the cli directory, replacing the current KubeRay CLI. While the kubectl plugin will overlap with the existing KubeRay CLI in some ways (especially in managing KubeRay cluster), it will go further by enhancing day-to-day operations with Ray clusters. These enhancements include authenticating to clusters, establishing local sessions, submitting jobs, and scaling clusters. In addition, the kubectl plugin will not depend on the KubeRay API server, making it viable for a larger audience of KubeRay users.
80 | 
81 | The CLI will extend kubectl using kubectl’s plugin mechanism. See [Extend kubectl with plugins](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/) for more details.
82 | 
83 | MVP Scope:
84 | * `kubectl ray cluster get|list`
85 | * `kubectl ray cluster scale`
86 | * `kubectl ray cluster session`
87 | * `kubectl ray cluster dashboard`
88 | * `kubectl ray cluster logs`
89 | 
90 | Future Scope:
91 | * `kubectl ray cluster create|update|delete`
92 | * `kubectl ray create cluster –provider=gke|eks|aks|etc`
93 |     * Support for adding provider specific YAML like GCSFuse/EFS mounts, loadbalancers, etc
94 | * `kubectl ray job get|list`
95 | * `kubectl ray job submit`
96 | 


--------------------------------------------------------------------------------
/reps/2023-04-27-data-strict-mode.md:
--------------------------------------------------------------------------------
 1 | # Roll out "strict mode" for Ray Data
 2 | 
 3 | ## Summary
 4 | 
 5 | Make a (breaking) API change to always require data schemas in Ray Data, dropping support for standalone Python objects. In addition to unification and simplicity benefits, this aligns the Ray Data API closer to industry-standard distributed data APIs like Apache Spark and also emerging standards for machine learning datasets like HuggingFace.
 6 | 
 7 | ### General Motivation
 8 | 
 9 | This REP proposes rolling out a breaking API change to Ray Data, termed "strict mode". In strict mode, support for standalone Python objects is dropped. This means that instead of directly storing, e.g., Python `Tuple[str, int]` instance in Ray Data, users will have to either give each field a name (i.e., `{foo: str, bar: int}`), or use a named object-type field (i.e., `{foo: object}`). In addition, strict mode removes the "default" batch format in place of "numpy" by default. This means that most users just need to be aware of `Dict[str, Any]` (non-batched data records) and `Dict[str, np.ndarray]` (batched data) types when working with Ray Data.
10 | 
11 | The motivation for this change is to cut down on the number of alternative representations users have to be aware of in Ray Data, which complicate the docs, examples, and add to new user confusion.
12 | For reference, this is the main PR originally introducing strict mode: https://github.com/ray-project/ray/pull/34336
13 | 
14 | **Full list of changes**
15 | - All read apis return structured data, never standalone Python objects.
16 | - Standalone Python objects are prohibited from being returned from map / map batches.
17 | - Standalone Numpy arrays are prohibited from being returned from map / map batches.
18 | - There is no more special interpretation of single-column schema containing just `__value__` as a column.
19 | - The default batch format is "numpy" instead of "default" (pandas).
20 | - schema() returns a unified Schema class instead of Union[pyarrow.lib.Schema, type].
21 | 
22 | **Datasource behavior changes**
23 | - `range_tensor`: create "data" col instead of `__value__`
24 | - `from_numpy`/`from_numpy_refs` : create "data" col instead of using `__value__`
25 | - `from_items`: create "item" col instead of using Python objects
26 | - `range`: create "id" column instead of using Python objects
27 | 
28 | The change itself has been well received in user testing, so the remainder of this REP will focus on the rollout strategy.
29 | 
30 | ### Should this change be within `ray` or outside?
31 | main `ray` project. Changes are made to Ray Data.
32 | 
33 | ## Stewardship
34 | ### Required Reviewers
35 | The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
36 | 
37 | @amogkam, @c21
38 | 
39 | ### Shepherd of the Proposal (should be a senior committer)
40 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
41 | 
42 | @pcmoritz
43 | 
44 | ## Rollout Plan
45 | 
46 | ### Impact of Changes
47 | 
48 | The proposed change mainly impacts users that are working with in-memory data objects and image datasets. For these users, they will get an error when trying to load data without a schema (e.g., ``StrictModeError: Error validating <data_item>: standalone Python objects are not allowed in strict mode. Please wrap the item in a dictionary like `{data: <data_item>}`. For more details and how to disable strict mode, visit DOC_URL_HERE.``).
49 | 
50 | ### Notification
51 | 
52 | The main method of notification will be the ``StrictModeError`` exception raised when the user tries to create disallowed data types. The exception will link to documentation on how to upgrade / disable strict mode.
53 | 
54 | We will also add a warning banner (for a couple releases) on the first import of Ray Data that notifies users of this change.
55 | 
56 | ### Timeline
57 | 
58 | - Ray 2.5: Enable strict mode by default, with the above notification plan.
59 | - Ray 2.6: No changes.
60 | - Ray 2.7 or after: Enforce strict mode always, and remove code for supporting the legacy code paths.
61 | 
62 | ## Examples:
63 | 
64 | ### Before
65 | ```python
66 | ds = ray.data.range(5)
67 | # -> Datastream(num_blocks=1, schema=<class 'int'>)
68 | 
69 | ds.take()[0]
70 | # -> 0
71 | 
72 | assert ds.take_batch()
73 | # -> [0, 1, 2, 3, 4]
74 | 
75 | ds.map_batches(lambda b: b * 2).take_batch()  # b is coerced to pd.DataFrame
76 | # -> pd.DataFrame({"id": [0, 2, 4, 6, 8]})
77 | ```
78 | 
79 | ### After
80 | ```python
81 | ds = ray.data.range(1)
82 | # -> Datastream(num_blocks=1, schema={id: int64})
83 | 
84 | ds.take()[0]
85 | # -> {"id": 0}
86 | 
87 | assert ds.take_batch()
88 | # -> {"id": np.array([0, 1, 2, 3, 4])}
89 | 
90 | ds.map_batches(lambda b: {"id": b["id"] * 2}).take_batch()  # b is Dict[str, np.ndarray]
91 | # -> {"id": np.array([0, 2, 4, 6, 8])}
92 | ```
93 | 
94 | Note that in the "after" code, the datastream always has a fixed schema, and the batch type is consistently a dict of numpy arrays.
95 | 
96 | ## Test Plan and Acceptance Criteria
97 | 
98 | The master branch will have strict mode on by default. There will be a suite that tests basic functionality with strict mode off, to avoid regressions.
99 | 


--------------------------------------------------------------------------------
/reps/2022-03-09-shuffle.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | ### General Motivation
  3 | 
  4 | Shuffle is an important primitive for data processing, e.g., sort, and ML ingest workloads, e.g., shuffling training data.
  5 | Currently, Ray offers shuffle through the Datasets library, but the implementation has known data scalability and performance limitations past 10TB data scales.
  6 | The goal of this proposal is to improve Datasets shuffle stability and scalability, through both Datasets and Ray core improvements.
  7 | By the end of this work, we hope to achieve petabyte-scale shuffle operations with Datasets.
  8 | 
  9 | ### Should this change be within `ray` or outside?
 10 | 
 11 | These changes would lie within Ray Datasets, to improve the shuffle algorithm, and within Ray core, to improve the shuffle execution.
 12 | 
 13 | ## Stewardship
 14 | ### Required Reviewers
 15 | 
 16 | @stephanie-wang, @ericl, @scv119
 17 | 
 18 | ### Shepherd of the Proposal (should be a senior committer)
 19 | 
 20 | @ericl
 21 | 
 22 | ## Design and Architecture
 23 | 
 24 | Currently Datasets shuffle is implemented as a map and reduce stage of Ray tasks. For example:
 25 | 
 26 | ```python
 27 | @ray.remote
 28 | def map(partition, split_fn):
 29 |     return [block for block in split_fn(partition)]
 30 | 
 31 | @ray.remote
 32 | def reduce(*partitions):
 33 |     return merge(partitions)
 34 | 
 35 | map_outputs = [map.options(num_returns=num_reducers).remote(partition, split_fn) for partition in partitions]
 36 | reduce_outputs = [reduce.remote(*[map_output[i] for map_output in map_outputs]) for i in range(num_reducers)]
 37 | ray.get(reduce_outputs)
 38 | ```
 39 | 
 40 | This forms a task graph that looks something like this:
 41 | 
 42 | ![MapReduce](https://miro.medium.com/max/680/1*nJYIs2ktVkqVsgSUCzfjaA.gif)
 43 | 
 44 | The number of intermediate map outputs increases quadratically with the number of partitions and therefore the dataset size. This has two primary problems:
 45 | 1. I/O efficiency worsens as the data size gets larger and the size of each intermediate map output gets smaller.
 46 | 2. The system overhead of each map output becomes a scalability bottleneck.
 47 | 
 48 | We propose addressing (1) through improvements in the Datasets library and (2) through Ray core improvements.
 49 | 
 50 | ### Ray Datasets
 51 | 
 52 | To improve I/O efficiency, we will incorporate some of the work done in the [Exoshuffle paper](https://arxiv.org/abs/2203.05072) on improving shuffle performance on Ray.
 53 | Exoshuffle implements a "push-based shuffle" using Ray tasks, which pushes map outputs directly to their reducer while the map stage is still executing.
 54 | More details on push-based shuffle can be found in the [Magnet paper](https://dl.acm.org/doi/10.14778/3415478.3415558), which implemented the algorithm as a part of Spark.
 55 | 
 56 | In the current Datasets shuffle, the metadata overhead for `ObjectRefs` in Ray becomes a bottleneck at around 10TB or more.
 57 | The Exoshuffle work reduces the total number of `ObjectRefs` and showed that it is possible to run sort at 100TB or more data scales on Ray.
 58 | To incorporate this work, we will:
 59 | 1. Unify Datasets shuffle-based primitives on the same internal shuffle implementation.
 60 | 2. Benchmark Datasets shuffle-based primitives.
 61 | 3. Implement a push-based shuffle in Datasets.
 62 | 
 63 | ### Ray core
 64 | 
 65 | Although push-based shuffle reduces the amount of system metadata, it is not enough to scale to petabyte-size data.
 66 | Thus, we also propose a number of Ray core improvements to reduce the total amount of metadata needed during shuffle operations.
 67 | These improvements center around reducing the per-`ObjectRef` metadata needed.
 68 | Currently, each task requires about 5KB of metadata and each object requires about 2KB of metadata at the driver.
 69 | We plan to reduce these by:
 70 | 
 71 | 1. "Collapsing" `ObjectRef` metadata for objects returned by the same task. All shuffle-on-Ray implementations rely on tasks with multiple return values. Currently, the metadata for these objects is stored separately, but we can combine these to amortize the metadata cost.
 72 | 2. "Collapsing" task metadata for tasks submitted in the same stage. This is analogous to the above, but for tasks that are "similar".
 73 | 3. Optimizing per-task and per-object metadata. Many of the metadata fields are not actually set or often have the same values. We can potentially save memory by not including these fields.
 74 | 
 75 | #### Fault tolerance and scheduling support
 76 | 
 77 | The push-based shuffle implemented in Exoshuffle currently requires precise placement of each task to reduce data movement across the cluster.
 78 | Currently, this is implemented using node-specific resource requirements.
 79 | However, this will hang the job if one of those nodes fails.
 80 | 
 81 | To ensure fault tolerance, we plan to implement soft scheduling constraints to allow tasks to execute even if their original node fails.
 82 | In the future, we can improve this further by providing a higher-level API to express scheduling constraints for groups of tasks.
 83 | 
 84 | ## Compatibility, Deprecation, and Migration Plan
 85 | 
 86 | This proposal will not include any API changes other than a way to choose the shuffle implementation in Datasets.
 87 | 
 88 | ## Test Plan and Acceptance Criteria
 89 | The proposal should discuss how the change will be tested **before** it can be merged or enabled. It should also include other acceptance criteria including documentation and examples. 
 90 | 
 91 | We plan to benchmark Datasets using the following workloads:
 92 | 1. Shuffling data loader for ML ingest.
 93 | 2. Sorting.
 94 | 3. Groupby, using the [h2oai benchmark](https://h2oai.github.io/db-benchmark/).
 95 | 
 96 | We will also use chaos testing (random node failures) to check fault tolerance.
 97 | 
 98 | Initially, we will test the Datasets sort and compare performance both to Exoshuffle and the theoretical best (based on disk bandwidth specs).
 99 | We will also test scalability to confirm the current shuffle bottleneck in Datasets and determine the amount of driver memory needed to support a petabyte-scale sort.
100 | 
101 | Acceptance criteria:
102 | * Shuffling data loader for ML ingest can scale to 100TB or more (requires push-based shuffle in Datasets).
103 | * Ray Datasets sort can scale to petabyte-size data (requires Ray core optimizations).
104 | * Ray includes a shuffle workload with millions of partitions in the nightly CI tests.
105 | 
106 | ## (Optional) Follow-on Work
107 | 
108 | Some of the follow-up work includes:
109 | * Further optimizations for groupby and other end-to-end Datasets benchmarks
110 | * Improving scalability to larger clusters (100s or 1000 nodes)
111 | * Providing a high-level API to express scheduling constraints between groups of tasks
112 | * Further optimizations to reduce object overhead in Ray core (e.g., multi-part objects)
113 | 


--------------------------------------------------------------------------------
/reps/2024-06-16-support-apache-yunikorn-scheduler.md:
--------------------------------------------------------------------------------
  1 | # Support new batch scheduler option: Apache YuniKorn
  2 | 
  3 | ## Motivation
  4 | 
  5 | The [Support batch scheduling and queueing](https://github.com/ray-project/kuberay/issues/213) allows Ray to integrate
  6 | easily with non-default Kubernetes scheduler, such as [Volcano](https://volcano.sh/). This proposal aims to add support for
  7 | another popular Kubernetes scheduler: [Apache YuniKorn](https://yunikorn.apache.org/). This provides Ray users with 
  8 | another option to leverage the [scheduling features](https://yunikorn.apache.org/docs/next/get_started/core_features)
  9 | YuniKorn offers.
 10 | 
 11 | ## Existing state
 12 | 
 13 | ### Current Batch Scheduler Option
 14 | 
 15 | Currently, the batch scheduler feature can be enabled using the `--enable-batch-scheduler` boolean option.
 16 | When set to `true`, the operator is started with the following Helm chart value override:
 17 | 
 18 | ```shell
 19 | --set batchScheduler.enabled=true
 20 | ```
 21 | 
 22 | The scheduler manager in the KubeRay operator will initiate the scheduler plugins. The framework provides hooks for
 23 | each scheduler implementation to inject custom resources and modify pod metadata accordingly. When a Ray cluster is
 24 | created, the framework calls the appropriate scheduler plugin functions based on the scheduler name provided
 25 | by `ray.io/scheduler-name`.
 26 | 
 27 | ### Limitations
 28 | 
 29 | The framework is designed to be scheduler-agnostic and provides general hooks for supporting different scheduler options.
 30 | However, once the `--enable-batch-scheduler` option is set to `true`, the scheduler manager will attempt to load all
 31 | scheduler schemas by calling the `AddToScheme(scheme *runtime.Scheme)` function implemented by each scheduler plugin.
 32 | This loads all the schemas, including CRDs defined in the implementation. Since only the `VolcanoBatchScheduler`
 33 | is currently implemented, it always loads Volcano's `PodGroup` CRD in the controller runtime,
 34 | requiring the installation of Volcano CRDs even when enabling other scheduler options.
 35 | 
 36 | ## Proposed changes
 37 | 
 38 | ### Targeted scheduler option
 39 | 
 40 | To support Apache YuniKorn, and more importantly, to support other schedulers in Ray.
 41 | We propose to add an option to explicitly set the scheduler name, i.e `--batch-scheduler`.
 42 | Option syntax: `--batch-scheduler [SupportedSchedulerName]`, for example:
 43 | 
 44 | Option Syntax:
 45 | 
 46 | ```shell
 47 | # use volcano
 48 | --batch-scheduler=volcano
 49 | 
 50 | # use yunikorn
 51 | --batch-scheduler=yunikorn
 52 | ```
 53 | 
 54 | When a scheduler name is specified, the scheduler manager will only load the configured scheduler plugin,
 55 | ensuring only necessary resources are loaded in the controller runtime.
 56 | 
 57 | The option `--batch-scheduler` accepts a single scheduler name as the value, and the value must be `volcano`
 58 | or `yunikorn` (before other scheduler plugins introduced). If an unrecognized scheduler name is provided,
 59 | the controller will fail to start with an error indicating that the scheduler plugin is not found.
 60 | 
 61 | ### Deprecate "ray.io/scheduler-name"
 62 | 
 63 | With the scheduler name is specified in the operator startup options, there is no need to set `ray.io/scheduler-name`
 64 | in `RayJob` or `RayCluster` CRs. This option should be marked as deprecated and eventually removed.
 65 | 
 66 | ### Compatibility
 67 | 
 68 | To maintain backwards compatibility, the `--enable-batch-scheduler` option will remain supported for a few more
 69 | releases. However, it will be marked as deprecated, and users are encouraged to switch to the new option.
 70 | Below are the scenarios how these options can be used:
 71 | 
 72 | ```shell
 73 | # case 1:
 74 | #  use volcano
 75 | --enable-batch-scheduler=true
 76 | 
 77 | # case 2:
 78 | #  use volcano
 79 | --batch-scheduler=volcano
 80 | 
 81 | # case 3:
 82 | #  use yunikorn
 83 | --batch-scheduler=yunikorn
 84 | 
 85 | # case 4:
 86 | #  invalid options, error message: do not use two options together
 87 | #  for simplicity, only one of these 2 options should be used
 88 | --enable-batch-scheduler=true
 89 | --batch-scheduler=volcano|yunikorn
 90 | ```
 91 | 
 92 | ### YuniKorn scheduler plugin behavior
 93 | 
 94 | The yunikorn scheduler plugin will support both `RayCluster` and `RayJob` resources. The integration will be lightweight,
 95 | as yunikorn does not require new CRDs. The plugin will add labels to the Ray pods, and the yunikorn scheduler will
 96 | schedule Ray pods based on these labels.
 97 | 
 98 | To enable yunikorn scheduler, set the following options when starting the KubeRay operator:
 99 | 
100 | ```shell
101 | --batch-scheduler=yunikorn
102 | ```
103 | 
104 | if submitting a `RayCluster`, add `yunikorn.apache.org/queue-name` and `yunikorn.apache.org/application-id` to the labels.
105 | 
106 | ```yaml
107 | apiVersion: ray.io/v1
108 | kind: RayCluster
109 | metadata:
110 |   labels:
111 |     yunikorn.apache.org/queue-name: root.abc
112 |     yunikorn.apache.org/application-id: rayjob-sample-ltpjh 
113 | ```
114 | 
115 | The `RayCluster` will be submitted to `root.abc` queue and scheduled by the yunikorn scheduler. The `RayCluster` will be
116 | recognized as an "application" with ID "rayjob-sample-ltpjh".
117 | 
118 | If submitting a `RayJob`, provide only the queue name:
119 | 
120 | ```yaml
121 | apiVersion: ray.io/v1
122 | kind: RayJob
123 | metadata:
124 |   name: rayjob-example
125 |   namespace: my-namespace
126 |   labels:
127 |     yunikorn.apache.org/queue-name: root.abc
128 | ```
129 | 
130 | when the Ray job is submitted to the cluster, the Ray operator will create the following `RayCluster` CR:
131 | 
132 | ```yaml
133 | apiVersion: ray.io/v1
134 | kind: RayCluster
135 | metadata:
136 |   labels:
137 |     # RayCluster inherits the labels from RayJob
138 |     yunikorn.apache.org/queue-name: root.abc
139 |     # the same job ID defined in the RayJob spec, or generated by the controller 
140 |     yunikorn.apache.org/application-id: rayjob-sample-ltpjh 
141 | ```
142 | 
143 | ### YuniKorn scheduler plugin details
144 | 
145 | The YuniKorn scheduler plugin looks for relevant labels in the RayCluster and populates the following labels
146 | to all the pods created by the RayCluster CR:
147 | 
148 | ```yaml
149 | apiVersion: v1
150 | kind: Pod
151 | metadata:
152 |   labels:
153 |     app.kubernetes.io/created-by: kuberay-operator
154 |     app.kubernetes.io/name: kuberay
155 |     # value is taken from RayCluster CR label: "yunikorn.apache.org/application-id"
156 |     applicationId: rayjob-sample-ltpjh
157 |     # value is taken from RayCluster CR label: "yunikorn.apache.org/queue"
158 |     queue: root.abc
159 | spec:
160 |   schedulerName: yunikorn
161 | ```
162 | Details about the meaning of these labels can be found in this
163 | [doc](https://yunikorn.apache.org/docs/user_guide/labels_and_annotations_in_yunikorn#labels-and-annotations-in-yunikorn).
164 | YuniKorn will recognize all these pods as part of the same application "rayjob-sample-ltpjh" and apply the
165 | scheduling features accordingly.
166 | 
167 | 
168 | 


--------------------------------------------------------------------------------
/reps/2022-03-08-serve_pipeline.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Summary - Serve Pipeline
  3 | ### General Motivation
  4 | Production machine learning serving pipelines are getting longer and wider. They often consist of multiple, or even tens of models collectively making a final prediction, such as image / video content classification and tagging, fraud detection pipeline with multiple policies and models, multi-stage ranking and recommendation, etc.
  5 | 
  6 | Meanwhile, the size of a model is also growing beyond the memory limit of a single machine due to the exponentially growing number of parameters, such as GPT-3, sparse feature embeddings in recsys models such that the ability to do disaggregated and distributed inference is desirable and future proof.
  7 | 
  8 | We want to leverage the programmable and general purpose distributed computing ability of Ray, double down on its unique strengths (scheduling, communication and shared memory) to facilitate authoring, orchestrating, scaling and deployment of complex serving pipelines under one set of DAG API, so a user can program & test multiple models or multiple shards of a single large model dynamically, deploy to production at scale, and upgrade individually.
  9 | #### Key requirements:
 10 | - Provide the ability to author a DAG of serve nodes to form a complex inference graph.
 11 | - Pipeline authoring experience should be fully python programmable with support for dynamic selection, control flows, user business logic, etc.
 12 | - DAG can be instantiated and locally executed using tasks and actors API
 13 | - DAG can be deployed via declarative and idempotent API, individual nodes can be reconfigured and scaled indepenently. 
 14 | 
 15 | ### Should this change be within `ray` or outside?
 16 | main `ray` project. Changes are made to Ray Core and Ray Serve level.
 17 | 
 18 | ## Stewardship
 19 | ### Required Reviewers
 20 | The proposal will be open to the public, but please suggest a few experience Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
 21 | 
 22 | @ericl, @edoakes, @simon-mo, @jiaodong
 23 | 
 24 | ### Shepherd of the Proposal (should be a senior committer)
 25 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
 26 | 
 27 | @ericl
 28 | 
 29 | ## Design and Architecture
 30 | 
 31 | ### Example - Diagram 
 32 | 
 33 | We want to author a simple diamond-shaped DAG where user provided inputs is send to two models (m1, m2) where each access partial or idential input, and also forward part of original input to the final ensemble stage to compute final output. 
 34 | 
 35 |                    m1.forward(dag_input[0])
 36 |                 /                          \
 37 |         dag_input ----- dag_input[2] ------ ensemble -> dag_output
 38 |                 \                          /  
 39 |                    m2.forward(dag_input[1])  
 40 |                                            
 41 |          
 42 | ### Example - Code
 43 | 
 44 | Classes or functions decorated by ray can be directly used in Ray DAG building.
 45 | ```python
 46 | @ray.remote
 47 | class Model:
 48 | def __init__(self, val):
 49 |     self.val = val
 50 | def forward(self, input):
 51 |     return self.val * input
 52 | 
 53 | @ray.remote
 54 | def ensemble(a, b, c):
 55 |     return a + b + c
 56 | 
 57 | async def request_to_data_int(request: starlette.requests.Request):
 58 |     data = await request.body()
 59 |     return int(data)
 60 | 
 61 | # Args binding, DAG building and input preprocessor definition
 62 | with ServeInputNode(preprocessor=request_to_data_int) as dag_input:
 63 |     m1 = Model.bind(1)
 64 |     m2 = Model.bind(2)
 65 |     m1_output = m1.forward.bind(dag_input[0])
 66 |     m2_output = m2.forward.bind(dag_input[1])
 67 |     ray_dag = ensemble.bind(m1_output, m2_output, dag_input[2])
 68 | ```
 69 | 
 70 | A DAG authored with Ray DAG API should be locally executable just by Ray Core runtime.
 71 | 
 72 | ```python
 73 | # 1*1 + 2*2 + 3
 74 | assert ray.get(ray_dag.execute(1, 2, 3)) == 8
 75 | ```
 76 | 
 77 | A Ray DAG can be built into an `serve application` that contains all nodes needed.
 78 | ```python
 79 | # Build, configure and deploy
 80 | app = serve.pipeline.build(ray_dag)
 81 | ```
 82 | 
 83 | Configure individual deployments in app, with same variable name used in ray_dag.
 84 | ```python
 85 | app.m1.set_options(num_replicas=3)
 86 | app.m2.set_options(num_replicas=5)
 87 | ```
 88 |  
 89 | We reserve the name and generate a serve `ingress` deployment that takes care of HTTP / gRPC, input schema validation, adaption, etc. It's our python interface to configure pipeline ingress. 
 90 | ```python
 91 | app.ingress.set_options(num_replicas=10)
 92 |  
 93 | # Translate to group_deploy behind the scene
 94 | app_handle = app.deploy()
 95 |  
 96 | # Serve App is locally executable
 97 | assert ray.get(app_handle.remote(1, 2, 3)) == 8
 98 | ```
 99 | 
100 | A serve pipeline application can be built into a YAML file for structured deployment, and configurable by the Ops team by directly mutating configurable fields without deep knowledge or involvement of model code in the pipeline. 
101 | ```python
102 | deployment.yaml = app.to_yaml()
103 | 
104 | # Structured deployment CLI
105 | serve deploy deployment.yaml
106 | ```
107 | 
108 | ## Compatibility, Deprecation, and Migration Plan
109 | An important part of the proposal is to explicitly point out any compability implications of the proposed change. If there is any, we should thouroughly discuss a plan to deprecate existing APIs and migration to the new one(s).
110 | 
111 | - Ray Core
112 |   - Serve Pipeline is co-designed with Ray Unified DAG API, where each DAG is always authored using Ray DAG API first.
113 |   - The only new API introduced is `.bind()` method on ray decorated function or class.  
114 | - Ray Serve 
115 |   - Serve Pipeline DAG is transformed from Ray DAG where classes used are replaced with serve `Deployment` and class instances with deployment's `RayServeHandle` for better compatibility, deprecation as well as migration.
116 | 
117 | - Breaking Changes: Ray Serve
118 |   - All args and kwargs passed into class or function in Serve Pipeline needs to be JSON serializable, enforced upon `build()` call. 
119 |   - We need to introduce and abstract out an `Ingress` component for serve pipeline. 
120 | 
121 | - Deprecation 
122 |   - Existing Serve Pipeline Alpha API will be deprecated in favor of Ray Unified DAG API as well as Serve Pipeline Beta.
123 | 
124 | - Migration Plan: Ray Serve
125 |   - New concepts and API introduced will be applied to Serve Pipeline Beta launch first to minimize compatibility risks. We can expect existing deployment implementation will migrate to `Ingress` and `Serve App` APIs later on.
126 |   - Existing multi-model pipeline using Alpha API or raw deployment handle is expected to be migrated to Pipeline Beta API over time.
127 | 
128 | 
129 | ## Test Plan and Acceptance Criteria
130 | The proposal should discuss how the change will be tested **before** it can be merged or enabled. It should also include other acceptance criteria including documentation and examples. 
131 | 
132 | - Unit and integration test for core components
133 | - Benchmarks on common multi-model inference workload 
134 | - Documentation with representative workload, covered by CI.
135 | 
136 | ## (Optional) Follow-on Work
137 | - Performance optimizations for multi-model inference, such as communication, multiplexing, scale-to-zero, etc.
138 | - UX and UI improvements for better user experience 
139 | - Exploration of large model Distributed Inference on Serve Pipeline where each node represents a shard of a large model.
140 | 


--------------------------------------------------------------------------------
/reps/2024-05-21-kuberay-authentication.md:
--------------------------------------------------------------------------------
  1 | # Kubernetes Native Ray Authentication
  2 | 
  3 | Ray, in its default configuration, lacks robust security and authentication measures.
  4 | Its deployment on Kubernetes provides some level of protection by leveraging RBAC for access control of RayCluster resources.
  5 | However, once provisioned, a RayCluster remains vulnerable to unauthorized access from anyone with network connectivity to the Ray head node.
  6 | 
  7 | This proposal introduces a Kubernetes aware sidecar proxy for authenticating external access to the Ray head.
  8 | It will leverage the existing Kubernetes authentication system to grant tokens used to securely access Ray head endpoints.
  9 | This simplifies Ray authentication by centralizing management of tokens and users within Kubernetes.
 10 | 
 11 | ## Authentication Scope
 12 | 
 13 | While there is a large surface area of a Ray cluster that could benefit from authenticated access, this proposal focuses specifically on securing Ray head endpoints that are frequently accessed externally.
 14 | This includes enforcing authentication for both the dashboard server and the interactive client server. Securing access from internal clients like the Raylet to the GCS server is not addressed in this proposal
 15 | as network policies are typically sufficient to protect this communication.
 16 | 
 17 | ## Sidecar Authentication Proxy
 18 | 
 19 | To enforce authenticated access to specific Ray head ports, the KubeRay operator will deploy a sidecar container alongside the Ray head pod.
 20 | This sidecar will function as a reverse proxy, validating authentication tokens for requests to the dashboard port (8265) and the interactive client port (10001).
 21 | Traffic to other ports will pass through the sidecar proxy without requiring authentication.
 22 | 
 23 | The KubeRay operator will be changed in the following ways:
 24 | 1. The Ray head container will bind to localhost only
 25 | 2. A reverse proxy sidecar will run alongside the Ray head container
 26 | 3. A ServiceAccount is created per RayCluster resource
 27 | 
 28 | The sidecar ensures only requests with the correct authorization tokens can access the Ray head. More details on how tokens are authenticated below.
 29 | For the MVP, the sidecar will be implemented using Envoy configured with an [External Authorizer](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_authz_filter).
 30 | 
 31 | ## Access Control with TokenReview API
 32 | 
 33 | The sidecar authenticator will use the TokenReview API to authenticate tokens passed into the authorization headers of each request.
 34 | Tokens can be created from Kubernetes ServiceAccounts or an external identity provider as long as the Kubernetes API Server is configured with an external authorizer.
 35 | Tokens from Kubernetes Service Accounts are required to specify a unique audience for the cluster. For now the placeholder audience is "ray.io/cluster/<cluster-name>".
 36 | Each RayCluster will be provisioned a default ServiceAccount that KubeRay Operator will use when authenticating to the Ray head (specifically for RayJob and RayService).
 37 | Users can use the default ServiceAccount in the absence of external identity providers, but using external identity providers is strongly recommended.
 38 | 
 39 | Example of a TokenReview request that contains a token:
 40 | ```
 41 | apiVersion: authentication.k8s.io/v1
 42 | kind: TokenReview
 43 | spec:
 44 |   token: <your_token_here>
 45 |   audiences:
 46 |   - "ray.io/cluster/<cluster-name>"
 47 | ```
 48 | 
 49 | Example of a TokenReview response with user information about the token:
 50 | ```
 51 | apiVersion: authentication.k8s.io/v1
 52 | kind: TokenReview
 53 | status:
 54 |   authenticated: true
 55 |   user:
 56 |     username: system:serviceaccount:default:ray-cluster
 57 |     uid: <some_uid>
 58 |     groups: ["system:serviceaccounts", "system:serviceaccounts:default"]
 59 |   audiences:
 60 |   - "ray.io/cluster/<cluster-name>"
 61 | ```
 62 | 
 63 | ## User to Ray Head Authentication
 64 | 
 65 | When submitting jobs using the Ray job CLI:
 66 | ```
 67 | export RAY_JOB_HEADERS="Authorization: Bearer $(kubectl create token ray-cluster)"
 68 | OR
 69 | export RAY_JOB_HEADERS="Authorization: Bearer $(gcloud auth print-access-token)" # example external identity provider
 70 | 
 71 | ray job submit...
 72 | ```
 73 | 
 74 | When using Ray interactive client:
 75 | ```
 76 | export RAY_CLUSTER_TOKEN=$(kubectl create token ray-cluster)
 77 | OR
 78 | export RAY_CLUSTER_TOKEN=$(gcloud auth print-access-token) # example external identity provider
 79 | ------------------------------------------------------------
 80 | 
 81 | import os
 82 | import ray
 83 | import base64
 84 | 
 85 | def get_metadata():
 86 |     """
 87 |     Get authentication header metadata for ray.util.connect
 88 |     """
 89 |     headers = {"Authorization": f"Bearer {os.environ["RAY_CLUSTER_TOKEN"]}"}
 90 |     return [(key.lower(), value) for key, value in headers.items()]
 91 | 
 92 | ray.init(
 93 |     "ray-cluster:443",
 94 |     metadata=get_metadata(),
 95 |     secure=True,
 96 | )
 97 | ```
 98 | 
 99 | ## RayCluster API Changes
100 | 
101 | Enabling authentication should be optional per RayCluster. Two new fields `enableAuthentication` and `authenticationOptions` will be added to the RayCluster spec to enable this capability and configure allowed principals.
102 | If no allowed principals are specified, it will default to a ServiceAccount with the same namespace and name as the RayCluster.
103 | 
104 | ```
105 | apiVersion: ray.io/v1
106 | kind: RayCluster
107 | metdata:
108 |   ...
109 | spec:
110 |   enableAuthentication: true
111 |   authenticationOptions:
112 |     allowedPrincipals:
113 |     - my-user@example.com   # example user
114 |     - system:serviceaccount:default:ray-cluster  # example service account
115 |     - system:authenticated  # example group
116 | ```
117 | 
118 | ## Dynamic Access Control with Kubernetes RBAC and the SubjectAccessReview API
119 | 
120 | Beyond token authentication via the TokenReview API, Kubernetes Role-Based Access Control (RBAC) provides a dynamic mechanism to manage which principals (users or service accounts)
121 | have access to Ray clusters. This can be achieved by introducing a custom verb and leveraging the SubjectAccessReview API.
122 | 
123 | A custom verb `admin` can be used in Roles that reference the `rayclusters` resource:
124 | ```
125 | apiVersion: rbac.authorization.k8s.io/v1
126 | kind: Role
127 | metadata:
128 |   name: ray-admins
129 |   namespace: my-team
130 | rules:
131 | - apiGroups: ["ray.io"]
132 |   resources: ["rayclusters"]
133 |   verbs: ["admin"]
134 | ```
135 | 
136 | Role bindings to this role can grant users access to the Ray cluster. The auth proxy can use the `SubjectAccessReview` API to verify if
137 | the authenticated user also has admin access to the targetted Ray cluster.
138 | 
139 | Below is an example `SubjectAccessReview` request that would be sent from the auth proxy:
140 | ```
141 | apiVersion: authorization.k8s.io/v1
142 | kind: SubjectAccessReview
143 | spec:
144 |   user: userA
145 |   resourceAttributes:
146 |     verb: admin
147 |     group: ray.io
148 |     resource: rayclusters
149 |     name: ray-cluster
150 |     namespace: my-team
151 | ```
152 | 
153 | If the authenticated user has access to the `admin` verb for the Ray cluster resource, the `SubjectAccessReview` request returns an `allowed` status
154 | and the auth proxy grants access to the authenticated user.
155 | 
156 | ## Risks
157 | 
158 | Enabling the authentication proxy puts the Kubernetes control plane in the critical path for accessing the Ray head. This is a trade-off users should consider when enabling this capability.
159 | To mitigate this, we may consider implementing an in-memory cache of access control rules in the sidecar proxy so that every request to the sidecar does not require a TokenReview or SubjectAccessReview request.
160 | 


--------------------------------------------------------------------------------
/reps/2023-07-08-air-surface-syntax.md:
--------------------------------------------------------------------------------
  1 | # Refining the Ray AIR Surface API
  2 | 
  3 | ## Summary
  4 | 
  5 | Disband the `ray.air` namespace and get rid of Ray AIR sessions.
  6 | 
  7 | ### General Motivation
  8 | 
  9 | Ray AIR has made it significantly easier to use Ray's scalable machine learning
 10 | libraries
 11 | - Ray Data for batch inference and last mile data processing and ingestion,
 12 | - Ray Train for machine learning training and
 13 | - Ray Serve for model and application serving
 14 | 
 15 | together.
 16 | 
 17 | One piece of feedback we have frequently received from users is that they are confused how Ray AIR
 18 | relates to the individual libraries. In particular:
 19 | - When should I use AIR's abstractions (e.g. should I use `BatchPredictor` or use Ray Data's map functionality,
 20 | should I use `PredictorDeployment` or deploy my model with Ray Serve directly?) and
 21 | - How does the `ray.air` namespace relate to `ray.data`, `ray.train` and `ray.serve`?
 22 | 
 23 | The `ray.air` namespace both containing low level common utilities as well as highler level
 24 | abstraction adds to this confusion. We have also learned that the higher level abstractions we
 25 | originally introduced for Ray AIR become unneccessary and the same functionality can nicely be achieved
 26 | with the libraries themselves by making the libraries a little more interoperable.
 27 | 
 28 | We have already implemented this strategy by replacing `BatchPredictor` with Ray Data native functionality
 29 | (see https://github.com/ray-project/enhancements/blob/main/reps/2023-03-10-batch-inference.md and
 30 | https://docs.ray.io/en/master/data/batch_inference.html) and by
 31 | improving Ray Train's ingestion APIs
 32 | (https://github.com/ray-project/enhancements/blob/main/reps/2023-03-15-train-api.md and
 33 | https://docs.ray.io/en/master/ray-air/check-ingest.html).
 34 | 
 35 | As a result of these changes, the `ray.air` namespace has become less and less relevant, and in this
 36 | REP we propose to go all the way and remove it altogether in line with the Zen of Python
 37 | ```
 38 | There should be one -- and preferably only one -- obvious way to do it.
 39 | ```
 40 | This solves the confusions mentioned above and makes the Ray AIR APIs more coherent and focused around
 41 | the cricital workloads (`ray.data` for batch inference, `ray.train` for training and `ray.serve` for serving).
 42 | 
 43 | ### Should this change be within `ray` or outside?
 44 | 
 45 | main `ray` project. Changes are made to Ray Train, Tune and AIR.
 46 | 
 47 | ## Stewardship
 48 | 
 49 | ### Required Reviewers
 50 | The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
 51 | 
 52 | @matthewdeng, @krfricke
 53 | 
 54 | ### Shepherd of the Proposal (should be a senior committer)
 55 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
 56 | 
 57 | @ericl
 58 | 
 59 | ## Details of the API changes
 60 | 
 61 | Concretely, we replace the Ray AIR session with a training context to
 62 | 1. avoid the user confusion of what a `session` is (and not having to explain in the documentation) and
 63 | 2. bring the API in line with other Ray APIs like `get_runtime_context` as well as Ray Data's `DataContext`.
 64 | 
 65 | The API changes are
 66 | ```
 67 | from ray import air, train
 68 | 
 69 | # Ray Train methods and classes:
 70 | 
 71 | air.session.report               -> train.report
 72 | air.session.get_dataset_shard    -> train.get_dataset_shard
 73 | air.session.get_checkpoint       -> train.get_checkpoint
 74 | air.Checkpoint                   -> train.Checkpoint
 75 | air.Result                       -> train.Result
 76 | 
 77 | # Ray Train configurations:
 78 | 
 79 | air.config.CheckpointConfig      -> train.CheckpointConfig
 80 | air.config.FailureConfig         -> train.FailureConfig
 81 | air.config.RunConfig             -> train.RunConfig
 82 | air.config.ScalingConfig         -> train.ScalingConfig
 83 | 
 84 | # Ray TrainContext methods:
 85 | 
 86 | air.session.get_experiment_name  -> train.get_context().get_experiment_name
 87 | air.session.get_trial_name       -> train.get_context().get_trial_name
 88 | air.session.get_trial_id         -> train.get_context().get_trial_id
 89 | air.session.get_trial_resources  -> train.get_context().get_trial_resources
 90 | air.session.get_trial_dir        -> train.get_context().get_trial_dir
 91 | air.session.get_world_size       -> train.get_context().get_world_size
 92 | air.session.get_world_rank       -> train.get_context().get_world_rank
 93 | air.session.get_local_rank       -> train.get_context().get_local_rank
 94 | air.session.get_local_world_size -> train.get_context().get_local_world_size
 95 | air.session.get_node_rank        -> train.get_context().get_node_rank
 96 | 
 97 | del air
 98 | ```
 99 | 
100 | These changes are ready to try out with https://github.com/ray-project/ray/pull/36706 and we encourage user feedback on the changes.
101 | 
102 | ## Open Questions
103 | 
104 | We are likely going to remove `PredictorWrapper` and `PredictorDeployment` and migrate the examples to use Ray Serve deployments
105 | direcly, and we are also likely going to move `air.integrations` to `train.integrations`.
106 | 
107 | For the `PredictorDeployment` removal, the user code will change from
108 | ```python
109 | from ray import serve
110 | from ray.serve import PredictorDeployment
111 | from ray.serve.http_adapters import pandas_read_json
112 | from ray.train.xgboost import XGBoostPredictor
113 | 
114 | # checkpoint = ...
115 | 
116 | serve.run(
117 |     PredictorDeployment.options(name="XGBoostService").bind(
118 |         XGBoostPredictor, checkpoint, http_adapter=pandas_read_json
119 |     )
120 | )
121 | ```
122 | to
123 | ```python
124 | import pandas as pd
125 | from starlette.requests import Request
126 | from ray import serve
127 | from ray.train.xgboost import XGBoostTrainer
128 | 
129 | # checkpoint = ...
130 | 
131 | @serve.deployment
132 | class XGBoostService:
133 |     def __init__(self, checkpoint):
134 |         self.model = XGBoostTrainer.get_model(checkpoint)
135 | 
136 |     async def __call__(self, http_request: Request):
137 |         input = await http_request.body()
138 |         data = pd.read_json(input.decode(), **http_request.query_params)
139 |         return self.model.predict(data)
140 | 
141 | serve.run(XGBoostService.bind(checkpoint))
142 | ```
143 | 
144 | This is almost as simple but a lot more explicit, removes the magic, and can
145 | be easily adapted to different settings. Furthermore it is more unified with
146 | the Ray Serve documentation and the way Ray Serve is typically used.
147 | 
148 | ## Internal changes
149 | 
150 | As part of this effort, we are also making the recommendation to completely
151 | remove the `air` namespace also for internal use (just to make things clearer
152 | for developers). This work does not need to be connected to a specific release
153 | and here is an idea on where things could go:
154 | 
155 | - `air.examples` -- don't have the examples in the source tree, instead put
156 | them into the `ray/doc` folder
157 | - `air.execution` -- due to the layering of Tune depending on Train but not
158 | the other way around, most likely `train._internal` is the right place for these.
159 | - `air.util` -- the tensor extension functionality should go into `ray.data`,
160 | the torch related functions into `ray.train.torch`.
161 | 
162 | If there are any other common internal utilities that are unaccounted for,
163 | most likely `train._internal` is a good place to put them.
164 | 
165 | ## Migration Plan
166 | 
167 | We acknowledge that these kinds of API changes are very taxing on our users and we paid special attention that the migration can be done
168 | easily as a simple text substitution without needing large changes for existing code bases. To enable a smooth migration, both APIs will
169 | work for the Ray 2.7 release.
170 | 
171 | Examples and documentation will be fully converted by Ray 2.7 and the old versions of the APIs will print deprecation warnings together
172 | with instructions on how the user code needs to be upgraded.
173 | 


--------------------------------------------------------------------------------
/reps/2023-03-15-train-api.md:
--------------------------------------------------------------------------------
  1 | # Simplifying Ray Train Ingest APIs
  2 | 
  3 | ## Summary
  4 | Deprecate the preprocessor and object store memory arguments for `ray.train.Trainer`.
  5 | 
  6 | ### General Motivation
  7 | 
  8 | This doc proposes to simplify Train's DatasetConfig as we move to the new streaming backend by default for Datasets. Similar to as noted in https://github.com/ray-project/enhancements/pull/25, Ray Datasets will have both lazy and streaming execution by default in Ray 2.4. Furthermore, `DatasetPipeline` will be deprecated in the future to consolidate functionality on the Dataset class.
  9 | 
 10 | With these changes, a few possibilities for simplification open up in the Train API:
 11 | 1. Decoupling Preprocessors from Trainers, so that Data preprocessing is performed on the Dataset explicitly by the user prior to passing the Dataset into the Trainer.
 12 | 2. Using the same resource limiting API in Train as in Datasets (i.e., `ExecutionResources`), instead of having a separate `max_object_store_memory_fraction` config.
 13 | 
 14 | Simplification is greatly desirable here, since users commonly find Dataset\<\>Trainer interactions difficult to understand and debug.
 15 | 
 16 | ### Should this change be within `ray` or outside?
 17 | main `ray` project. Changes are made to Ray Data and Ray AIR level.
 18 | 
 19 | ## Stewardship
 20 | ### Required Reviewers
 21 | The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
 22 | 
 23 | @amogkam, @c21, @clarkzinzow, @jianoiax
 24 | 
 25 | ### Shepherd of the Proposal (should be a senior committer)
 26 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
 27 | 
 28 | @stephanie-wang
 29 | 
 30 | ## Design and Architecture
 31 | 
 32 | ## API Changes
 33 | 
 34 | 1. Introduce a `resource_limits: ExecutionResources(object_store_memory=2 * GiB)` arg to `ray.air.DatasetConfig`. This enables streaming by default, with a limit of 2GiB, and deprecates the previous `max_object_store_memory_fraction` argument.
 35 | 
 36 | 2. Introduce `Dataset.get_logical_plan()` (DeveloperAPI), which returns the logical plan that can be used to extract the lineage of preprocessors applied to this Dataset. If multiple preprocessors are applied, Train can return a `Chain` of the preprocessors. Non-preprocessor operations on the Dataset are ignored, and we can also allow ignoring preprocessors such as per epoch preprocessors. This function will be used by Train to persist fitted preprocessors with checkpoints.
 37 | 
 38 | 3. Deprecate the following additional Trainer configs when streaming is enabled: `global_shuffle` and `randomize_block_order` (user to use native Dataset shuffle ops), and the preprocessing args `fit`, `transform`, `preprocessor`, and `per_epoch_preprocessor` (user to setup preprocessing explicitly prior to creating Trainer).
 39 | 
 40 | ## Example:
 41 | 
 42 | ### Before
 43 | ```python
 44 | 
 45 | base = ray.data.read_parquet("s3://bucket/etl_output")
 46 | fact_table = ray.data.read_csv("s3://bucket/my.csv")
 47 | 
 48 | # Create the preprocessor.
 49 | prep = StandardScaler(["f1", "f2"])
 50 | 
 51 | # Create the per-epoch preprocessor.
 52 | per_epoch_prep = RandomNoisePreprocessor()
 53 | 
 54 | # Trainer applies preprocessing internally via config.
 55 | trainer = TorchTrainer(
 56 |     model,
 57 |     datasets={
 58 |         "train_ds": train_ds,
 59 |         "fact_table": fact_table,
 60 |     },
 61 |     scaling_config=ScalingConfig(num_workers=4),
 62 |     preprocessor=prep,
 63 |     dataset_config={
 64 |         "train_ds": {
 65 |             "max_object_store_memory_fraction": 0.2,  # Enable streaming.
 66 |             "randomize_block_order": True,
 67 |             "per_epoch_preprocessor": per_epoch_prep,
 68 |         },
 69 |     },
 70 | )
 71 | 
 72 | # Checkpoint includes fitted preprocessor.
 73 | best_checkpoint = trainer.fit().checkpoint
 74 | assert best_checkpoint.get_preprocessor() == prep
 75 | ```
 76 | 
 77 | ### After
 78 | ```python
 79 | 
 80 | base = ray.data.read_parquet("s3://bucket/etl_output")
 81 | fact_table = ray.data.read_csv("s3://bucket/my.csv")
 82 | 
 83 | # Fit the preprocessor.
 84 | prep = StandardScaler(["f1", "f2"])
 85 | prep.fit(base)
 86 | 
 87 | # Apply base preprocessing.
 88 | train_ds = base.map_batches(prep)
 89 | train_ds.cache()  # Optional: cache the base data in memory.
 90 | 
 91 | # Per-epoch preprocessing.
 92 | per_epoch_prep = RandomNoisePreprocessor()
 93 | per_epoch_prep.ignore_for_inference = True
 94 | train_ds = train_ds \
 95 |     .randomize_block_order() \
 96 |     .map_batches(per_epoch_prep)
 97 | 
 98 | # Trainer doesn't know about preprocessing at all.
 99 | trainer = TorchTrainer(
100 |     model,
101 |     datasets={
102 |         "train_ds": train_ds,
103 |         "fact_table": fact_table,
104 |     },
105 |     scaling_config=ScalingConfig(num_workers=4),
106 |     dataset_config={
107 |         "train_ds": {
108 |             "resource_limits": ExecutionResources(
109 |                 object_store_memory=20e9,  # Customized streaming memory limit.
110 |             ),
111 |         },
112 |     },
113 | )
114 | 
115 | # Checkpoint includes fitted preprocessor.
116 | best_checkpoint = trainer.fit().checkpoint
117 | assert best_checkpoint.get_preprocessor() == prep
118 | ```
119 | 
120 | While the "after" code is longer, note that all the data processing code is now cleanly separated from the Trainer, which both a conceptual and practical simplification. In addition, having the fitted preprocessor computed early enables the user code to reference it (e.g., to get computed categories, etc.).
121 | 
122 | ## FAQ
123 | 
124 | - Q: What if I wanted to change per-trial datasets / prep with Tune?
125 | - A: You could prepare multiple datasets lazily on the driver.
126 | 
127 | - Q: Are we deprecating the preprocessor arg for Train entirely?
128 | - A: Yes.
129 | 
130 | - Q: Will we still save the preprocessor in the Checkpoint?
131 | - A: Yes, this doesn't change.
132 | 
133 | - Q: Should we have both `Preprocessor.transform` and `Dataset.map_batches`?
134 | - A: We will deprecate the former.
135 | 
136 | - Q: What happens if you apply multiple preprocessors to a Datasets?
137 | - A: The checkpoint will have the full chain, including per-epoch ones. Preprocessors can be tagged as used for training only / ignored during inference by setting an `ignore_for_inference` (constructor) attribute.
138 | 
139 | - Q: What happens if you apply ordinary functions to the Dataset?
140 | - A: You'll get a warning that these functions are not captured in the preprocessing chain, and to use BatchMapper if you want that.
141 | 
142 | - Q: Why not require the user to do all Data operations outside of Train, including the split?
143 | - A: This would break tuning, as Train needs to create a separate Data stream per trial. This is not possible post split as calling split is a consumption operation.
144 | 
145 | ## Compatibility, Deprecation, and Migration Plan
146 | An important part of the proposal is to explicitly point out any compability implications of the proposed change. If there is any, we should thouroughly discuss a plan to deprecate existing APIs and migration to the new one(s).
147 | 
148 | Ray 2.4: Lay the groundwork for these new APIs
149 | - Streaming on by default in Datasets only (not Train).
150 | - API changes from the related inference REP https://github.com/ray-project/enhancements/pull/25
151 | 
152 | Ray 2.5: Onboard new users onto new APIs
153 | - Introduce the API changes proposed above, and enable streaming by default in Train.
154 | - Deprecated APIs will be inaccessible in streaming mode for Train.
155 | - Legacy APIs will be fully supported in non-streaming mode for Train.
156 | - Rewrite docs and examples to use new APIs.
157 | 
158 | Ray 2.6/7: Deprecate old APIs
159 | - Full feature parity with global / windowed shuffles using new streaming data APIs.
160 | - Fully deprecate DatasetPipeline / legacy Train APIs.
161 | 
162 | 
163 | ## Test Plan and Acceptance Criteria
164 | The proposal should discuss how the change will be tested **before** it can be merged or enabled. It should also include other acceptance criteria including documentation and examples. 
165 | 
166 | - Unit and integration for new APIs
167 | - Documentation and examples on new API.
168 | 


--------------------------------------------------------------------------------
/reps/2022-12-08-ray-for-federated-learning-and-privacy-preserving-computing.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Summary - Ray for federated learning and privacy-preserving computing
  3 | ### General Motivation
  4 | Federated machine learning and privacy-preserving computing are getting wider. They're obviously distributed systems because they're always across parties(companies or ORGs.) To extend Ray ecosystem to federated learning and privacy-preserving computing domain, we should let users be easy to use Ray, for building their own federated learning and privacy-preserving computing applications without any concern on data privacy, task-attacks, illegal invasion, and etc.
  5 | 
  6 | PS: I also found another related issue: https://github.com/ray-project/ray/issues/25846
  7 | 
  8 | In this proposal, we'd like to build a connector layer on Ray, to provide the ability for users use Ray easily to build their such kind of system, with the secure issues addressed totally.
  9 | 
 10 | ### Key requirements:
 11 | - Provide the ability for users to easily build their such kind of application on Ray.
 12 | - Setup different Ray clusters for different parties to avoid uncotrollable complex Ray protocols. 
 13 |   - We have tried to setup a single cross-party Ray cluster but failed for security reasons. It is difficult for parties to detect and prevent attacks, e.g. a malicious attacker (it could even be one of the parties) runs destructive code. The complex Ray protocols make it very difficult to enhance the security under single Ray cluster. 
 14 | - Have a unified and global-viewed programming mode for different parties.
 15 | - Data transmission across parties should be in push mode instead of pull mode. It's hard to prevent malicious attacker stealing data in pull mode. Push mode is much better since it's the responsibility of the party to decide whether to send data to others.
 16 | - Tasks should be driven in multi-controller mode. That means tasks should be driven by themselves(who are inside this party) instead of others.
 17 | 
 18 | ### Should this change be within ray or outside?
 19 | No, there is no need to change ray repo, the steps should be:
 20 | 1. Adding a new repo `RayFed` under `ray-project`: `ray-project/RayFed`
 21 | 2. The package name is `rayfed`: pip install rayfed
 22 | 3. importing lines should be `import fed`
 23 | 
 24 | ## Stewardship
 25 | ### Required Reviewers
 26 | @jovany-wang
 27 | ### Shepherd of the Proposal (should be a senior committer)
 28 | @jovany-wang 
 29 | 
 30 | ## Design and Architecture
 31 | The key words in this design are `multiple controller mode`, `restricted data perimeters`.
 32 | 
 33 | ### User Examples
 34 | #### A Simple Example
 35 | ```python
 36 | import fed
 37 | 
 38 | @fed.remote
 39 | class My:
 40 |     def incr(val):
 41 |         return self._val += val
 42 | 
 43 | @fed.remote(party="BOB")
 44 | def mean(val1, val2):
 45 |     return np.mean(val1, val2)
 46 | 
 47 | 
 48 | fed.init()
 49 | 
 50 | my1 = My.party("ALICE").remote()
 51 | my2 = My.party("BOB").remote()
 52 | 
 53 | val1 = my1.incr.remote(10)
 54 | val2 = my1.incr.remote(20)
 55 | 
 56 | result = mean.remote(val1, val2)
 57 | fed.get(result)
 58 | 
 59 | fed.shutdown()
 60 | ```
 61 | 
 62 | From the very simple example code, we create an actor for ALICE and BOB, and then do an aggregation in BOB. Note that, the 2 actors are not in the same Ray cluster.
 63 | 
 64 | #### A Federated Learning Example - Diagram
 65 | ![image](https://user-images.githubusercontent.com/19473861/206469265-e104888e-50c8-4313-aa18-f2caf0928ef1.png)
 66 | 
 67 | ALICE and BOB are doing their local training loop and synchronizing the weights every once in a while. In this proposal, it's easy to author the code on the top of Ray. 👇
 68 | 
 69 | #### A Federated Learning Example - Code
 70 | ```python
 71 | import numpy as np
 72 | import tensorflow as tf
 73 | from sklearn.preprocessing import OneHotEncoder
 74 | from tensorflow import keras
 75 | import fed
 76 | 
 77 | 
 78 | @fed.remote
 79 | class My:
 80 |     def __init__(self):
 81 |         self._model = keras.Sequential()
 82 |         self._model.compile(optimizer=keras.Adam(), loss=crossentropy)
 83 | 
 84 |     def load_data(self, batch_size: int, epochs: int):
 85 |         self._dataset = _load_data_from_local()
 86 | 
 87 |     def train(self):
 88 |         x, y = next(self._dataset)
 89 |         with tf.GradientTape() as tape:
 90 |             y_pred = self.model(x, training=True)
 91 |             loss = self.model.compiled_loss(y, y_pred)
 92 |             self._model.optimizer.apply_gradients(zip(
 93 |                 tape.gradient(loss, trainable_vars),
 94 |                 self._model.trainable_variables))
 95 | 
 96 |     def get_weights(self, party):
 97 |         return self._model.get_weights()
 98 | 
 99 |     def update_weights(self, party, weights):
100 |         self._model.set_weights(weights)
101 |         return True
102 | 
103 | 
104 | @fed.remote
105 | def mean(party, x, y):
106 |     return np.mean([x, y], axis=0)
107 | 
108 | 
109 | def main():
110 |     fed.init()
111 |     epochs = 100, batch_size = 4
112 | 
113 |     my_alice = My.party("alice").remote()
114 |     my_bob = My.party("bob").remote()
115 | 
116 |     my_alice.load_data.remote(batch_size, epochs)
117 |     my_bob.load_data.remote(batch_size, epochs)
118 | 
119 |     num_batchs = int(150 / batch_size)
120 |     for epoch in range(epochs):
121 |         # Locally training inner party.
122 |         for step in range(num_batchs):
123 |             my_alice.train.remote()
124 |             my_bob.train.remote()
125 | 
126 |         # Do weights aggregation and updating.
127 |         w_a = my_alice.get_weights.remote()
128 |         w_b = my_bob.get_weights.remote()
129 |         w_mean = mean.party('alice').remote(w_a, w_b)
130 |         result = fed.get(w_mean)
131 |         print(f'Epoch {epoch} is finished, mean is {result}')
132 |         n_wa = my_alice.update_weights.remote(w_mean)
133 |         n_wb = my_bob.update_weights.remote(w_mean)
134 | 
135 |     fed.shutdown()
136 | ```
137 | 
138 | ### Understanding the mechanism
139 | In this section, we're introducing the deep dive of the above federated learning example, to help understand the mechanism of this connector layer.
140 | 
141 | Firstly,  ALICE and BOB companies need to create their Ray clusters, and expose one communication port to each other.  
142 | And then, both ALICE and BOB need to run the example code in their Ray clusters(drive the DAG in multi-controller mode). Note that, the peer port and parties info should be passed via `fed.init()`.
143 | ![image](https://user-images.githubusercontent.com/19473861/206469580-468ea258-407e-40cf-b059-06a63a25c168.png)
144 | 
145 | Every Ray cluster will create the Ray tasks specified to its party, meanwhile, it will send the output data  to peer if the downstream Ray task are not specified to its own party. For example, in ALICE cluster, `mean.party('alice').remote(w_a, w_b)` will create a Ray task in ALICE cluster, but it wouldn't be created in BOB cluster. Due to `w_b` is an output data from BOB cluster, therefore, in ALICE cluster, it will be inserted a `recv_op` barrier to `mean` task, and in BOB cluster, it will be inserted a `send_op` barrier as a downstream of `w_b = my_bob.get_weights.remote()` task.
146 | 
147 | ### Benefits 
148 | Compared with running the DAG in one Ray cluster, we have the following significant benefits:
149 | - It's very close to Ray native programming pattern. But the DAG could be ran across Ray clusters.
150 | - Very restricted and clear data perimeters. we only transmit the data which is needed by others. And all of the data couldn't be fetched in any way if we don't allow.
151 | - If we run the DAG in one Ray cluster, data transmission is in PULL-BASED mode. For example, if BOB invokes `ray.get(f.options("PARTY_ALICE").remote())`, the return object is pulled by BOB, as a result, ALICE don't have the knowledge for that. In this proposal, it's in PUSH-BASED mode. ALICE has the knowledge for that there is a data object will be sent to BOB. This is a significant advantage of multi-controller mode.
152 | - All the disadvantages are addressed in this proposal: **code distribution**, **data privacy**, **task-attacks**, **illegal invasion**, **deserialization vulnerability**, and etc.
153 | - Brings all Ray advantages to the local Ray cluster, like highly performance RPC, fault tolerance, task scheduling/resource management, and other Ray ecosystem libraries（ Ray datasets, Ray train and Ray serve） for local training.
154 | 
155 | 
156 | 
157 | ## Compatibility, Deprecation, and Migration Plan
158 | None.
159 | 
160 | ## Test Plan and Acceptance Criteria
161 | - Test should include unit tests and integration tests(release tests).
162 | - Tests should be enabled in CI.
163 | - Benchmark on across parties data transmission for typical federated learning case.
164 | - Document page and necessary code comments.
165 | 
166 | ## (Optional) Follow-on Work
167 | - UX improvements for across parties exceptions passing.
168 | - Provide single-controller mode for quickly building applications and debug inner one party.
169 | - Performance optimization for other workloads.
170 | - A unified gateway for ETH communication.
171 | - Support TEE(Trust Execution Environment) device.


--------------------------------------------------------------------------------
/reps/2025-11-21-ray-history-server/2025-11-21-ray-history-server.md:
--------------------------------------------------------------------------------
  1 | ## Ray History Server
  2 | 
  3 | ### General Motivation
  4 | 
  5 | It is becoming increasingly common for Ray users to treat Ray clusters as ephemeral units of compute
  6 | that only run for the duration of a single (or multiple) Ray jobs. This pattern results in significant improvements in
  7 | cost efficiency and resource sharing, especially in capacity-constrained environments where hardware accelerators are scarce.
  8 | 
  9 | However, a fundamental trade-off of this approach, compared to long-lived interactive Ray clusters, is that users
 10 | lose access to the Ray Dashboard, which is often treated as the entry point for most observability signals
 11 | in a Ray cluster. While it’s possible to export the relevant signals to external sources, users prefer the experience
 12 | of using the Ray Dashboard as a single source of truth to debug job failures.
 13 | 
 14 | This enhancement proposal introduces the concept of a Ray History Server, which can orchestrate the reconstruction of the
 15 | Ray Dashboard even for terminated Ray clusters. This will be accomplished by leveraging Ray’s Event Export API to persist
 16 | task/actor/job state, and native components in KubeRay for pushing logs and events to a blob store (GCS, S3, etc).
 17 | 
 18 | ### Should this change be within `ray` or outside?
 19 | 
 20 | Some components/libraries, such as the event exporter, will be in Ray. Everything else will be hosted in the KubeRay project.
 21 | 
 22 | ## Stewardship
 23 | 
 24 | ### Owners
 25 | 
 26 | - @KunWuLuan (Alibaba)
 27 | - @MengjinYan, @Future-Outlier, @rueian (Anyscale)
 28 | - @andrewsykim, @Chia-ya (Google)
 29 | 
 30 | ### Shepherd of the Proposal (should be a senior committer)
 31 | 
 32 | @edoakes
 33 | 
 34 | ## Design and Architecture
 35 | 
 36 | ### Components and Libraries
 37 | 
 38 | The Ray History Server project will introduce the following components and libraries across Ray and KubeRay:
 39 | 
 40 | * Ray:
 41 |     * Updated Ray Dashboard frontend that can dynamically adjust request paths to fetch task/actor/job state from a history server.
 42 |     * Ray Event Export API, available starting in Ray 2.49, which supports task/actor/job/node events.
 43 |       The events can be used to generate the state of the Ray cluster at any given timestamp.
 44 | * KubeRay:
 45 |     * An events collector sidecar that receives push events from Ray (via RAY_enable_core_worker_ray_event_to_aggregator)
 46 |       and persists events to the blob store.
 47 |     * A logging collector sidecar that uploads Ray logs to the blob store.
 48 |     * A history server (standalone Deployment) that can process events for historical Ray clusters and
 49 |       serve Ray API endpoints requested by Ray Dashboards.
 50 |     * A storage reader/writer library providing a pluggable interface for different storage implementations.
 51 | 
 52 | ![Ray History Server Architecture](history_server_architecture.png)
 53 | 
 54 | #### Events Collector
 55 | 
 56 | The Events Collector is a sidecar container deployed alongside every Ray node. It operates as an
 57 | HTTP server ingesting events from Ray’s Event Export API (enabled via `RAY_enable_core_worker_ray_event_to_aggregator`).
 58 | The server exposes a single POST /events endpoint which receives event data as JSON objects.
 59 | Ray is configured to push these events to a localhost endpoint (configured with `RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR`).
 60 | The Events Collector is strictly responsible for persisting raw events to blob storage; it does not perform any pre-processing or deduplication.
 61 | 
 62 | #### Logging Collector
 63 | 
 64 | The Logging Collector is responsible for persisting logs in `/tmp/ray/session_latest/logs` to blob store.
 65 | While the Ray cluster is active, the Logging Collector will periodically upload snapshot of logs to storage.
 66 | Upon receiving a termination signal, it will attempt to upload a final snapshot of logs before exiting.
 67 | 
 68 | #### History Server
 69 | 
 70 | The History Server component is a stateless Kubernetes Deployment that serves API endpoints that are compatible with the Ray Dashboard.
 71 | To serve endpoints like `/api/v0/tasks`, the History Server will be responsible for server-side processing of events
 72 | in blob store that were uploaded by the Events Collector. In alpha / beta milestones, the history server will store
 73 | final task / actor / job states in-memory. For GA, we may reconsider this approoach if we identify scalability limitations.
 74 | More details on event processing below.
 75 | 
 76 | #### Event Processor
 77 | 
 78 | The Event Processor runs as a process within the History Server container. It is responsible for downloading the
 79 | complete event history of terminated clusters and aggregating that data into final states. These processed states
 80 | are then used to serve API requests from Ray Dashboard clients.
 81 | 
 82 | ### File structure for persisted events & logs
 83 | 
 84 | Users rarely filter events by node name; instead, they typically filter by job ID and time range.
 85 | Therefore, building an index based on job ID and timestamps is critical. Unlike the Spark History Server,
 86 | Ray events are emitted by an aggregation agent residing on each node; therefore, the collector on each specific node
 87 | is responsible for grouping the events.
 88 | 
 89 | All events will initially be partitioned by Job ID. Specifically, task events associated with the same Job ID will be stored in the same directory.
 90 | * Node-level events will be stored in: cluster_name_cluster_uid/session_id/node_events/<nodeName>-<time>
 91 | * Job-level events will be stored in: cluster_name/session_id/job_events/<jobID>/<nodeName>-<time>
 92 | 
 93 | ![Events File Structure](events_file_structure.png)
 94 | 
 95 | ## Implementation Plan
 96 | 
 97 | ### Phase 1 - KubeRay v1.6 (alpha)
 98 | 
 99 | In Phase 1, we will target an alpha release of the history server with KubeRay v1.6.
100 | 
101 | Alpha release scope:
102 | * Log collector - sidecar container hosted in KubeRay
103 | * Events collector - sidecar container hosted in KubeRay
104 | * Storage reader/writer libraries in KubeRay, initially supporting S3-compatible blob store and Aliyun storage
105 | * History Server container hosted in KubeRay, responsible for:
106 |     * event processing in-memory
107 |     * Supported endpoints:
108 |       * /cluster
109 |       * /events
110 |       * Ray API endpoints
111 |       * /api/cluster_status
112 |       * /api/grafana_health
113 |       * /api/prometheus_health
114 |       * /api/data/datasets/{job_id}
115 |       * /api/serve/applications/
116 |       * /api/v0/placement_groups/
117 |       * /api/v0/tasks
118 |       * /api/v0/tasks/summarize
119 |       * /nodes
120 |       * /nodes/{node_id}
121 |       * /logical/actors
122 |       * /logical/actors/{single_actor}
123 |       * /api/jobs
124 |       * /api/jobs/{job_id}
125 | * No dashboard orchestration; testing will be done using a local Ray dashboard that talks to the history server backend.
126 | * All components must be manually configured via RayCluster.
127 | 
128 | ### Phase 2 - KubeRay v1.7 (beta)
129 | 
130 | In Phase 2, we will target a Beta release of the history server. With this release, we will ask early adopters for feedback. No public documentation.
131 | 
132 | Beta release scope:
133 | * KubeRay can manage instances of the Ray Dashboard frontend
134 | * APIs in RayCluster to automate history server installation
135 | * All History Server components hosted in public quay.io KubeRay registry
136 | * Support for metrics
137 | * History server supports data retention policies for persisted logs and events in blob store.
138 | 
139 | ### Phase 3 - KubeRay v1.8 (GA)
140 | 
141 | GA release scope:
142 | * Public documentation
143 | * History server production readiness
144 | * E2E tests in KubeRay
145 | * At least 5 users have successfully used history server in production.
146 | 
147 | 
148 | ## Compatibility, Deprecation, and Migration Plan
149 | 
150 | * History Server will only be compatible with versions of Ray that support the events export API (Ray >= 2.49)
151 | 
152 | ## Test Plan and Acceptance Criteria
153 | 
154 | For Alpha:
155 | * A user can manually configure RayCluster to include all necessary components for History Server.
156 | * A local Ray dashboard can use the history server as an API backend to view a terminated Ray cluster.
157 | 
158 | For Beta:
159 | * A user can specify a top-level API in RayCluster to enable the history server.
160 | * A local Ray dashboard can use the history server as an API backend to view the state of a terminated Ray cluster.
161 | * A remote Ray dashboard running on Kubernetes (managed by KubeRay) can be used to view the state of a terminated Ray cluster.
162 | 
163 | For GA:
164 | * E2E tests cover all test criteria from Alpha / Beta.
165 | * A user can follow public documentation to enable all workflows related to the history server.
166 | 
167 | 
168 | ## (Optional) Follow-on Work
169 | 
170 | We will start with a naive approach to event processing on the history server. However, we may need to explore
171 | more optimal strategies if processing events introduces significant latency overhead or memory usage.
172 | 
173 | See [this doc](https://docs.google.com/document/d/15Y2bW4uzeUJe84FxRNUnHozoQPqYdLB2yLmgrdF2ZmI/edit?usp=sharing) for more
174 | details on follow-on work.
175 | 


--------------------------------------------------------------------------------
/reps/2022-10-11-serve-java-http-ingress.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | This enhancement proposal recommends adding an ingress feature to facilitate HTTP handling to the Ray Serve package in Java.
  3 | 
  4 | ### General Motivation
  5 | We want to call various methods within a Ray Serve Java deployment through HTTP proxy, in a way that's similar to the [FastAPI ingress](https://docs.ray.io/en/latest/serve/http-guide.html#fastapi-http-deployments) in Python version.
  6 | 
  7 | ### Should this change be within `ray` or outside?
  8 | The main `ray` project. Changes are made to the Ray Serve module in the Java codebase.
  9 | 
 10 | ## Stewardship
 11 | 
 12 | ### Required Reviewers
 13 | @simon-mo @sihanwang41
 14 | 
 15 | ### Shepherd of the Proposal (should be a senior committer)
 16 | @simon-mo
 17 | 
 18 | ## Design and Architecture
 19 | 
 20 | ### User Workflow
 21 | 
 22 | 1. Define the model with [JAX-RS API](https://docs.oracle.com/javaee/6/tutorial/doc/gilik.html)
 23 | 
 24 | ```java
 25 | @Path("user")
 26 | public class UserRestService {
 27 |     @GET
 28 |     @Path("helloWorld")
 29 |     @Consumes({"application/json"})
 30 |     @Produces({"application/json"})
 31 |     public String helloWorld(String name) {
 32 |         return "hello world, " + name;
 33 |     }
 34 | 
 35 |     @GET
 36 |     @Path("paramPathTest/{name}")
 37 |     @Consumes({"application/json"})
 38 |     @Produces({"application/json"})
 39 |     public String paramPathTest(@PathParam("name")String name) {
 40 |         return "paramPathTest, " + name;
 41 |     }
 42 | }
 43 | ```
 44 | 2. Create deployment
 45 | ```java
 46 | Deployment deployment =
 47 |         Serve.deployment()
 48 |             .setName("deploymentName")
 49 |             .setDeploymentDef(UserRestService.class.getName())
 50 |             .ingress("jax-rs")
 51 |             .setNumReplicas(1)
 52 |             .create();
 53 | deployment.deploy();
 54 | ```
 55 | 
 56 | 3. Calling deployments via HTTP
 57 | 
 58 | The URI is determined by the deployment name and JAX-RS @PATH annotations.
 59 | ```java
 60 | curl http://127.0.0.1:8000/deploymentName/user/helloWorld?name=test
 61 | curl http://127.0.0.1:8000/deploymentName/user/paramPathTest/test
 62 | ```
 63 | 
 64 | ### HTTP Ingress Implementation
 65 | #### Ingress API Annotation: JAX-RS
 66 | 
 67 | In Java application development, restful API is generally implemented through two sets of annotations, spring web or JAX-RS.
 68 | 
 69 | JAX-RS is a specification for implementing REST web services in Java, currently defined by the JSR-370. JAX-RS is just an API specification and has been implemented by many components: jersey, resteasy, etc.
 70 | 
 71 | Spring-web contains a lot of features that we don't need to use: IOC (inversion of control), DI (dependency injection), etc. The parsing of spring-web annotations depends on the spring framework.
 72 | 
 73 | In order not to import too many unnecessary dependencies to the user's `Callable`. We choose JAX-RS to support HTTP ingress.
 74 | 
 75 | For more information about JAX-RS and spring web, please click the following links.
 76 | 
 77 | JAX-RS: https://projects.eclipse.org/projects/ee4j.rest
 78 | 
 79 | Spring-web: https://docs.spring.io/spring-framework/docs/current/reference/html/web.html#mvc-ann-requestmapping
 80 | 
 81 | #### Annotation parser: Jersey
 82 | Jersey RESTful Web Services 2.x framework is open source, production quality, a framework for developing RESTful Web Services in Java that provides support for JAX-RS APIs and serves as a JAX-RS (JSR 311 & JSR 339 & JSR 370) Reference Implementation.
 83 | 
 84 | Jersey framework is more than the JAX-RS Reference Implementation. Jersey provides its own API that extends the JAX-RS toolkit with additional features and utilities to further simplify RESTful service and client development. Jersey also exposes numerous extension SPIs so that developers may extend Jersey to best suit their needs.
 85 | 
 86 | The most important thing is that the Jersey does not depend on a servlet container to run. And many other open-source frameworks need to integrate servlet containers.
 87 | 
 88 | For more information about Jersey, please click this link: https://eclipse-ee4j.github.io/jersey/
 89 | 
 90 | ##### Maven dependency
 91 | 
 92 | ```xml
 93 | <dependency>
 94 |    <groupId>org.glassfish.jersey.core</groupId>
 95 |    <artifactId>jersey-server</artifactId>
 96 |    <version>2.30.1</version>
 97 | </dependency>
 98 | <dependency>
 99 |    <groupId>org.glassfish.jersey.inject</groupId>
100 |    <artifactId>jersey-hk2</artifactId>
101 |    <version>2.30.1</version>
102 | </dependency>
103 | ```
104 | ##### Jersey Callable Integration
105 | In order to integrate Jersey in the the current `ray.serve` package, we will need to take the following step:
106 | 
107 | - convert `RequestWrapper` to jersey `ContainerRequest`
108 | - call `ApplicationHandler.apply` with `ContainerRequest`
109 | - return `ContainerResponse.getEntity` to the HTTP proxy
110 | 
111 | ```java
112 | public class JaxrsIngressCallable {
113 |   private ApplicationHandler app;
114 | 
115 |   public JaxrsIngressCallable(Class clazz) {
116 |     ResourceConfig resourceConfig = new ResourceConfig(clazz);
117 |     this.app = new ApplicationHandler(resourceConfig);
118 |   }
119 | 
120 |   public Object call(RequestWrapper httpProxyRequest) {
121 |     ContainerRequest jerseyRequest = convertRequestWrap2ContainerRequest(httpProxyRequest)
122 |     ContainerResponse response = app.apply(jerseyRequest).get();
123 |     Object rspBody = response.getEntity();
124 |     return rspBody;
125 |   }
126 | }
127 | ```
128 | 
129 | ### Extension of `callable`
130 | Normally, we use the class instance as a `ray.serve.Callable`. But in HTTP ingress, the `ray.serve.Callable` we use is wrapped with the Jersey application handler. We have two different implementations of `ray.serve.Callable`.
131 | 
132 | In python, when we add the `@ingress` annotation to an object, a new object will be generated, that is, a new `ray.serve.Callable` instance will be generated.
133 | 
134 | In java, we add annotations to the class, but it can not enhance the features of `ray.serve.Callable`. So we need a mechanism to generate different `ray.serve.Callable` instances according to different ingress types.
135 | 
136 | Here we use java SPI to implement this feature. We add a `ServeCallableProvider` SPI.
137 | 
138 | ```java
139 | public interface ServeCallableProvider {
140 |   /**
141 |    * Get Callable type
142 |    * @return Callable type
143 |    */
144 |   String type();
145 | 
146 |   /**
147 |    * Generate a Callable instance
148 |    * @param deploymentWrapper deployment info and config
149 |    * @return Callable instance
150 |    */
151 |   Object buildCallable(DeploymentWrapper deploymentWrapper);
152 | 
153 |   /**
154 |    * Get the signature of callable
155 |    * @param deploymentWrapper  deployment info and config
156 |    * @param callable Callable instance
157 |    * @return
158 |    */
159 |   Map<String, Pair<Method, Object>> getSignatures(DeploymentWrapper deploymentWrapper, Object callable);
160 | }
161 | ```
162 | 
163 | The current version of `Callable` is implemented as the default interface implementation. An additional implementation of the `ServeCallableProvider` is added for each additional ingress type.
164 | 
165 | ### Method Signature Cache
166 | Typically, our `Callable` are user-provided class instances. In JAX-RS, the `Callable` is the jersey application handler when the replica is called using an HTTP proxy. When using the serve handle to call the replica, we cannot call any methods in the class. We are not compatible with the serve handle.
167 | 
168 | On the other hand, every time we call replica, we need to use reflection to get the method that needs to be executed. This will reduce the performance and throughput of the replica
169 | 
170 | In order to solve the above problems, We need to hold the cache of method signatures.
171 | 
172 | ![image](https://user-images.githubusercontent.com/11265783/195356752-95cf595b-f235-477c-b041-47b3760ad0f5.png)
173 | 
174 | Ray core uses the signature to decide which method to call. we want to be consistent with it.
175 | 
176 | When init java replica, we will parse signatures from the `Callable` class. Generate the following map. the key is the method signature, value is the pair of the method instance and the `Callable` instance.
177 | 
178 | ```java
179 | Map<String, Pair<Method, Object>> signatures;
180 | ```
181 | 
182 | If the user configures JAX-rs ingress, we will add one data to the signature cache. The key is always set to `__call__`, the left of the pair in the value is a `Method` instance of the `JaxrsIngressCallable.call` and the right of the pair is an instance of `JaxrsIngressCallable`.
183 | 
184 | For requests from HTTP proxy, the method signature will be fixed to `__call__`. The request from the serve handle will set the method signature on the client side and hit the signature cache on the server side.
185 | 
186 | ## Compatibility, Deprecation, and Migration Plan
187 | New features are incremental and do not affect any existing features. And we use SPI to make the modification of the java HTTP ingress meet the open-closed principle.
188 | 
189 | ## Test Plan and Acceptance Criteria
190 | - Unit and integration test for core components
191 | - Benchmarks on java HTTP ingress performance
192 | 
193 | ## (Optional) Follow-on Work
194 | - `callable` support SPI
195 | - method signature cache
196 | - http ingress with jersey application handler
197 | - benchmark test
198 | 


--------------------------------------------------------------------------------
/reps/2023-04-28-remove-algorithms-from-rllib.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | 
  3 | We'd like to create the `rllib_contrib` directory inside of rayproject/ray for community contributed algorithms and algorithms with low usage in RLlib. We'd like to start by migrating approximately ~25 of the 30 algorithms from RLlib into `rllib_contrib`. We are considering doing this because 
  4 | 
  5 | 1. Doing so will greatly increase the ability / lower the barrier of entry of community members to contribute new algorithms to RLlib.
  6 | 2. It would reduce the maintenance burden of RLlib. **By moving any algorithms from rllib into rllib_contrib, we are breaking the api surface between these algorithms and rllib. This means that we do not need to update them with any new features or provide any bug fixes to them.**
  7 | 3. Many of these algorithms [to be migrated to this directory] have successors that are more performant and/or easier to hyperparameter tune.
  8 | 4. The proposed algorithms have low usage by the community according to our ray cluster telemetry readings.
  9 | 
 10 | 
 11 | Each deprecated algorithm will have its own subdirectory in this repo containing the following:
 12 | 
 13 | 1. The implementation of the algorithm.
 14 | 2. Any tests that are associated with ensuring the correctness of the algorithm.
 15 | 3. End to end example scripts that demonstrate how to run the algorithm.
 16 | 4. The necessary requirements for running the algorithm (for example ray at a specific version).
 17 | 5. Boiler plate that allows the algorithm to be installed as a pip package.
 18 | 
 19 | 
 20 | ## Usage
 21 | 
 22 | Users will be able to replace the deprecated algorithms by doing the following:
 23 | 
 24 | In most usecases, the migration will look like replacing imports inside of an experiment script.
 25 | 
 26 | For example:
 27 | 
 28 | ```python
 29 | from gymnasium.wrappers import TimeLimit
 30 | 
 31 | import ray
 32 | from ray import air
 33 | from ray import tune
 34 | from ray.rllib.examples.env.cartpole_mass import CartPoleMassEnv
 35 | from ray.tune.registry import register_env
 36 | 
 37 | # from ray.rllib.algorithms.maml import MAML, MAMLConfig  # BEFORE
 38 | from rllib_maml.maml import MAML, MAMLConfig  # AFTER
 39 | 
 40 | if __name__ == "__main__":
 41 |     ray.init()
 42 |     register_env(
 43 |         "cartpole",
 44 |         lambda env_cfg: TimeLimit(CartPoleMassEnv(), max_episode_steps=200),
 45 |     )
 46 | 
 47 |     config = MAMLConfig()
 48 | 
 49 |     tuner = tune.Tuner(
 50 |         MAML,
 51 |         param_space=config.to_dict(),
 52 |         run_config=air.RunConfig(stop={"training_iteration": 100}),
 53 |     )
 54 |     results = tuner.fit()
 55 | 
 56 | ```
 57 | 
 58 | 
 59 | The proposed algorithms to migrate from RLlib are outlined in the table below, along with the estimated:
 60 | 
 61 | | Algorithm | Soft Deprecation Release # | Hard Deprecation Release # |
 62 | | --- | --- | --- |
 63 | | A3C | 2.5 | 2.8 |
 64 | | A2C | 2.6 | 2.9 |
 65 | | R2D2 | 2.6 | 2.9 |
 66 | | MAML | 2.5 | 2.8 |
 67 | | AphaStar | 2.6 | 2.9 |
 68 | | AlphaZero | 2.6 | 2.9 |
 69 | | ApexDQN | 2.6 | 2.9 |
 70 | | ApexDDPG | 2.6 | 2.9 |
 71 | | DDPG | 2.6 | 2.9 |
 72 | | ARS | 2.6 | 2.9 |
 73 | | Bandits | 2.6 | 2.9 |
 74 | | CRR | 2.6 | 2.9 |
 75 | | DDPG | 2.6 | 2.9 |
 76 | | DDPPO | 2.6 | 2.9 |
 77 | | Dreamer | 2.6 | 2.9 |
 78 | | DT | 2.6 | 2.9 |
 79 | | ES | 2.6 | 2.9 |
 80 | | Leelachess | 2.6 | 2.9 |
 81 | | MADDPG | 2.6 | 2.9 |
 82 | | MBMPO | 2.6 | 2.9 |
 83 | | PG | 2.6 | 2.9 |
 84 | | QMix | 2.6 | 2.9 |
 85 | | Random | 2.6 | 2.9 |
 86 | | SimpleQ | 2.6 | 2.9 |
 87 | | SlateQ | 2.6 | 2.9 |
 88 | | TD3 | 2.6 | 2.9 |
 89 | 
 90 | 
 91 | ## Design and Architecture
 92 | 
 93 | - We introduce the new directory for deprecated RLlib algorithms and community contributed algorithms.
 94 | 
 95 | ### `rllib-contrib` File Structure
 96 | 
 97 | ```
 98 | README.md
 99 | algorithm_foo/
100 |     src/
101 |         algorithm_foo/
102 |             __init__.py
103 |             algorithm_foo.py
104 |     tests/
105 |     examples/
106 |     latest_results/
107 |         results.json
108 |         results.csv
109 |         results.pkl
110 |         results.tensorboard
111 |     results.rst
112 |     requirements.txt
113 |     pyproject.toml  <---------- Used for installation
114 |     
115 | ```
116 | 
117 | ### Installation
118 | 
119 | Either install from pypi:
120 | 
121 | ` pip install rllib-contrib-ALGORITHM`
122 | 
123 | or install from source:
124 | 
125 | 1. Use git to clone `rayproject/ray`.
126 | 2. Navigate to the directory of the previously deprecated algorithm
127 |    for ex. `cd rllib-contrib/maml`.
128 | 3. Run `pip install -e .` to install the algorithm python module as a pip package.
129 | 4. Import the algorithm, its policies and its config from that package instead of from `ray.rllib`.
130 | 
131 | 
132 | ## Contribution Guidlines
133 | 
134 | 
135 | The RLlib team commits to the following level of support for the algorithms in this repo:
136 | 
137 | | Platform | Purpose | Support Level |
138 | | --- | --- | --- |
139 | | [Discuss Forum](https://discuss.ray.io) | For discussions about development and questions about usage. | Community |
140 | | [GitHub Issues](https://github.com/ray-project/rllib-contrib-maml/issues) | For reporting bugs and filing feature requests. | Community |
141 | | [Slack](https://forms.gle/9TSdDYUgxYs8SA9e8) | For collaborating with other Ray users. | Community |
142 | 
143 | **This means that any issues that are filed will be solved best-effort by the community and there is no expectation of maintenance by the RLlib team.**
144 | 
145 | We will generally accept contributions to this directory that meet any of the following criteria:
146 | 1. Updating dependencies.
147 | 2. Submitting community contributed algorithms that have been tested and are ready for use.
148 | 3. Enabling algorithms to be run in different environments (ex. adding support for a new type of gym environment).
149 | 4. Updating algorithms for use with the newer RLlib APIs.
150 | 5. General bug fixes.
151 | 
152 | We will not accept contributions that generally add significant new maintenance burden. In this case users should instead make their own repo with their contribution, **using the same guidelines as this repo** and the RLlib team can help to market/promote it in the ray docs.
153 | 
154 | ### Contributing new algorithms
155 | 
156 | If you would like to contribute a new algorithm to this directory, please follow the following steps:
157 | 1. Create a new directory with the same structure as the other algorithms.
158 | 2. Add a `README.md` file that describes the algorithm and its usecases.
159 | 3. Create unit tests/shorter learning tests and long learning tests for the algorithm.
160 | 4. Submit a PR and a RLlib maintainer will review it and help you set up your testing to integrate with the CI of this repo.
161 | 
162 | Regarding unit tests and long running tests:
163 | 
164 | - Unit tests are any tests that tests a sub component of an algorithm. For example tests that check the value of a loss function given some inputs.
165 | - Short learning tests should run an algorithm on an easy to learn environment for a short amount of time (e.g. ~3 minutes) and check that the algorithm is achieving some learning threshold (e.g. reward mean or loss).
166 | - Long learning tests should run an algorithm on a hard to learn environment (e.g.) for a long amount of time (e.g. ~1 hour) and check that the algorithm is achieving some learning threshold (e.g. reward mean or loss).
167 | 
168 | 
169 | ### Telemetry and Promoting Algorithms to RLlib
170 | 
171 | In Ray we have telemetry features that allow us to track the usage of algorithms. We'll have to establish a similar telemetry system for this repo. If an algorithm shows considerable usage compared to the usage that we see in RLlib, we can consider promoting it to RLlib and moving the maintenance burden to the RLlib team. 
172 | 
173 | It isn't crucial to add this telemetry system in for the initial release of this repo, but it is something that we will add in the future. This is because we don't expect the usage of this repo to be high initially.
174 | 
175 | ### Testing
176 | 
177 | Testing will leverage the existing buildkite infrastructure that we have in the oss repository today. Because we are leveraging the existing oss testing infrastructure, maintenance will be shared between the RL and dev prod teams. We will set up separate buildkite jobs for each algorithm to be run anytime there is a change to that specific algorithm. Doing this will ensure that we don’t waste compute resources. These jobs will run short running unit tests and short learning tests for the algorithm. Additionally we’ll have separate buildkite jobs for long running learning tests that use many resources. These tests will be manually triggered only if we enable them by adding special phrases to commit messages.
178 | 
179 | 
180 | ## Compatibility, Deprecation, and Migration Plan
181 | 
182 | ### Compatibility
183 | The dependencies of each algorithm are hard pinned for each algorithm's package. This means that if a user wants to use a newer version of ray, the burden of migrating the algorithm to the new version of ray is on the user.
184 | 
185 | ### Deprecation
186 | 1. We will keep the deprecated algorithms' classes around in the repo for at least 3 releases. We will add standard deprecation warnings to the algorithms and their components, but after the 3 releases we will add deprecation errors any time an algorithm or its related policy/model components are used.
187 | 
188 | 2. We will add thorough documentation to https://docs.ray.io/en/latest/rllib/rllib-algorithms.html that explains why the algorithms were deprecated and how to migrate to the new rllib contrib algorithm python packages.
189 | 
190 | 
191 | ## Stewardship
192 | 
193 | ### Shepherd of the Proposal
194 | 
195 | @sven1977
196 | 
197 | ### Required Reviewers
198 | 
199 | @kouroshhakha, @gjoliver, @richardliaw, @gjoliver, @sven1977, @ArturNiederfahrenhorst
200 | 
201 | 
202 | 
203 | 


--------------------------------------------------------------------------------
/reps/2023-03-20-air-storage-path.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | 
  3 | We'd like to change how users of Ray AIR (specifically: Train + Tune) configure permanent storage locations.
  4 | 
  5 | Current way:
  6 | - Users can specify `RunConfig.local_dir` or `tune.run(local_dir)` for storage on the local node
  7 | - Users can specify `SyncConfig.upload_dir` for storage on remote nodes
  8 | 
  9 | Future way:
 10 | - All remote storage is standardized on the built in RAY_STORAGE mechanism.
 11 | - We have a unified setting `RunConfig.storage_path` that can be used to override the default RAY_STORAGE path configured for the cluster.
 12 | - Setting the storage path to a cloud or NFS URI (e.g., `s3://`, or `file://` that points to a NFS mount). In these cases, data will be first written to a local cache dir on the worker, and then synced to a subdirectory in the storage path designated by `<experiment_name>/<trial_name>/`.
 13 | - Setting the storage path to a purely local URI (e.g., `/home/foo/ray_results`). In this mode, trial data is synced to the head node via the object store. We would generally recommend using a remote storage path or shared directory instead.
 14 | 
 15 | 
 16 | ```python
 17 | import ray
 18 | from ray import air, tune
 19 | 
 20 | results = tune.Tuner(train_fn, run_config=air.RunConfig(storage_path="s3://foo/bar")).fit()
 21 | assert results.best_checkpoint().path.startswith("s3://foo/bar")
 22 | ```
 23 | 
 24 | ### Background: Persistent storage in Ray Tune
 25 | 
 26 | Running a Ray Tune run creates a number of artifacts:
 27 | 
 28 | 1. Experiment-level data, such as the search configuration
 29 | 2. Trial-level, driver-based data, such as 
 30 |    - The configuration the trial used
 31 |    - The metrics (results) the trial reported (as CSV, JSON, TensorBoard)
 32 |    - Any errors the trial logged
 33 | 3. Trial-level, trainable-based data, such as
 34 |    - Checkpoints
 35 |    - Local logfiles (e.g. stdout/stderr log of the trainable)
 36 |    - Other user-created artifacts
 37 | 
 38 | 1 and 2 are handled by the driver. 3 is handled by the trainable (i.e. a remote actor).
 39 | 
 40 | This data is stored as follows:
 41 | 
 42 | - Assume we have a `storage_path=/path/to/storage` (can be local, shared, remote)
 43 | - Then, the experiment-level data will be stored in `/path/to/storage/experiment_name`
 44 | - The trial-based data will be stored in `/path/to/storage/experiment_name/trial_id`
 45 | - This is true for both the driver-based trial data and the trainable-based trial data
 46 | 
 47 | If the respective trial is running on the head node, this means that both the
 48 | driver-based and the trainable-based trial data will be saved in the same directory.
 49 | 
 50 | The same is true if a shared filesystem (e.g. NFS) is configured.
 51 | 
 52 | If the trial is running on a worker node, the trial directories will have only partial contents
 53 | on different nodes: The head node will have the driver-based data, and the worker node
 54 | will have the trainable-based data.
 55 | 
 56 | In this case, we need to synchronize the data to a common location, so that we have the full
 57 | data in one place.
 58 | 
 59 | Specifying this common persistent storage location and discussing the implementation of
 60 | the synchronization is the subject of this REP.
 61 | 
 62 | ### General Motivation
 63 | 
 64 | Users care about where their data is ultimately stored. They usually don't care about
 65 | how it got there.
 66 | 
 67 | However, currently they specify a local storage directory, from which data is periodically "synced"
 68 | to a remote storage location. This flow (collect locally, sync to storage) is more of an implementation
 69 | detail.
 70 | 
 71 | Instead, users should just specify where their data ends up - with `storage_path`.
 72 | 
 73 | In the long run, this concept helps us simplify the user journey and removes confusion
 74 | in the configuration.
 75 | 
 76 | Risk: Since we will continue to use locally cached data for the implementation,
 77 | hiding the location in e.g. an environment variable can make failures more opaque
 78 | to users.
 79 | 
 80 | 
 81 | ### Should this change be within `ray` or outside?
 82 | 
 83 | This is an API change to Ray AIR (specifically, `air.RunConfig`, `tune.SyncConfig`, and `tune.run()`)
 84 | and therefore has to be within Ray.
 85 | 
 86 | 
 87 | ## Stewardship
 88 | 
 89 | ### Required Reviewers
 90 | 
 91 | @richardliaw, @ericl, @gjoliver
 92 | 
 93 | ### Shepherd of the Proposal (should be a senior committer)
 94 | 
 95 | @richardliaw
 96 | 
 97 | ## Design and Architecture
 98 | 
 99 | - We introduce a new `storage_path` argument to `air.RunConfig` and `tune.run` as the main configuration entrypoints
100 | - The `storage_path` argument defaults to the configured `RAY_STORAGE` environment variable
101 | - If the `storage_path` is set to a remote URI, the `local_path` is read from an environment variable `RAY_AIR_CACHE_DIR`
102 | - Backwards compatibility: If a `local_dir` is passed, we set this environment variable
103 | - In downstream components (`TrialRunner`, `Experiment`, `Trial`), we introduce respective arguments: `storage_path` and `experiment_path`
104 | 
105 | We retain full backwards compatibility with the current API.
106 | 
107 | ## Synchronization
108 | 
109 | The synchronization to a common location already works today. However, we can improve this
110 | for better efficiency and to better define the responsibilities of different components.
111 | 
112 | ### Current implementation
113 | 
114 | The current flow multiplexes between two scenarios: With cloud storage enabled or without it.
115 | 
116 | When cloud storage is enabled:
117 | 
118 | - The driver saves experiment data to `/local/path/experiment_name`
119 | - The driver saves driver-based trial data to `/local/path/experiment_name/trial_id`
120 | - The driver uses a `Syncer` to periodically sync this state to the remote storage (e.g. S3), e.g. `s3://bucket/location/experiment_name`
121 | - This _includes_ driver-based trial data, such as the config and results log
122 | - This _excludes_ trainable-based trial data, such as checkpoints.
123 |   Uploading checkpoints is the trainable's responsibility.
124 | - The trainable saves trial-based data such as checkpoints to their local `/local/path/experiment_name/trial_id`
125 | - The trainable uses a `Syncer` to upload these checkpoints _synchronously_ to the remote storage after saving, e.g. `s3://bucket/location/experiment_name/trial_id/checkpoint_001`
126 | 
127 | When cloud storage is disabled:
128 | 
129 | - The driver saves experiment data to `/local/path/experiment_name`
130 | - The driver saves driver-based trial data to `/local/path/experiment_name/trial_id`
131 | - The trainable saves trial-based data such as checkpoints to their local `/local/path/experiment_name/trial_id`
132 | - The driver uses a `SyncerCallback` to synchronize contents form the trial directories (i.e. the missing trial-based data).
133 |   This uses the Ray object store.
134 | - This also happens _synchronously_ every time a trial reports a checkpoint has been saved
135 | 
136 | Currently, the object store-based transfer serializes all the data at the same time. Thus, the heap
137 | memory must be large enough to hold all the data to be transferred between nodes.
138 | 
139 | ### Future implementation
140 | 
141 | In the first step (when we just change the API), this flow can remain the same, as it fulfills
142 | all requirements.
143 | 
144 | The implementation for the case when a remote storage location is defined will remain as is.
145 | 
146 | However, the implementation for the local storage path is currently inefficient because 
147 | the driver synchronously blocks until trial-based data is synchronized using the `SyncerCallback`.
148 | 
149 | Instead, we should move this synchronization into the trial - analogous to the remote storage case.
150 | 
151 | Thus, when a local storage path is defined (cloud storage is disabled):
152 | 
153 | - The driver saves experiment data to `/local/path/experiment_name`
154 | - The driver saves driver-based trial data to `/local/path/experiment_name/trial_id`
155 | - The trainable saves trial-based data such as checkpoints to their local `/local/path/experiment_name/trial_id`
156 | - **The trainable uses the Ray object store to transfer these checkpoints _synchronously_ to the head node.**
157 | - Technically, the trainable will schedule an actor on the head node to receive the data.
158 |   (this may be a detached named actor to avoid scheduling too many processes at the head node).
159 | 
160 | The trainable can detect if the trial shares the storage path with the head node
161 | (e.g. if RAY_STORAGE is set, or if it's running on the same node). This then 
162 | removes the need for any syncing to happen. This will also cover the case when
163 | shared storage (e.g. NFS) is defined.
164 | 
165 | ## Usage Example
166 | 
167 | ### Basic usage
168 | 
169 | ```python
170 | import ray
171 | from ray import air, tune
172 | 
173 | results = tune.Tuner(train_fn, run_config=air.RunConfig(storage_path="s3://foo/bar")).fit()
174 | assert results.best_checkpoint().path.startswith("s3://foo/bar")
175 | ```
176 | 
177 | ### Usage when Ray storage is configured
178 | 
179 | ```python
180 | import ray
181 | from ray import air, tune
182 | 
183 | # STORAGE_PATH is configured to s3://foo/bar
184 | 
185 | results = tune.Tuner(train_fn).fit()
186 | assert results.best_checkpoint().path.startswith("s3://foo/bar")
187 | ```
188 | 
189 | 
190 | ### Usage with local directories
191 | 
192 | ```python
193 | import ray
194 | from ray import air, tune
195 | 
196 | results = tune.Tuner(train_fn, run_config=air.RunConfig(storage_path="/local/dir")).fit()
197 | 
198 | # even if train_fn ran on a different node, its logs + artifacts will be in /local/dir
199 | assert results.best_checkpoint().path.startswith("/local/dir")
200 | ```
201 | 
202 | ### Accessing storage paths
203 | 
204 | ```python
205 | import ray
206 | from ray import tune
207 | 
208 | results = tune.Tuner(train_fn).fit()
209 | 
210 | assert results.path.startswith("s3://bucket/dir/experiment_dir")
211 | 
212 | result = results.get_best_result()
213 | assert result.path.startswith("s3://bucket/dir/experiment_dir/trial0")
214 | 
215 | checkpoint = result.checkpoint
216 | assert checkpoint.path.startswith("s3://bucket/dir/experiment_dir/trial0/checkpoint_0010")
217 | ```
218 | 
219 | ## Compatibility, Deprecation, and Migration Plan
220 | 
221 | ### Compatibility
222 | 
223 | We retain full backwards compatibility with the current API.
224 | 
225 | The initial PR merges will keep the old API in the unit tests.
226 | 
227 | ### Deprecation plan
228 | 
229 | We are changing stable public APIs. Thus, we will follow a deprecation plan over three releases:
230 | 
231 | For public APIs (`tune.Tuner()`, `tune.run()`):
232 | 
233 | - In Ray 2.4, we soft deprecate `RunConfig.local_dir` and `SyncConfig.upload_dir` and warn if they are used
234 | - We keep this behavior in Ray 2.5
235 | - In Ray 2.6 we will raise an error if those arguments are still used
236 | - In Ray 2.7 we remove the old code paths
237 | 
238 | For developer APIs (`Trial`, `TrialRunner`, `Experiment`) we can speed up hard deprecation
239 | by one release.
240 | 
241 | ### Migration plan
242 | 
243 | 1. PR 1/n: [Clean-up path-related properties in developer API classes](https://github.com/ray-project/ray/pull/33370) 
244 | 2. PR 2/n: [Add `.path` properties to Result, ResultGrid, and Checkpoint](https://github.com/ray-project/ray/pull/33410)
245 | 3. PR 3/n: [Introduce storage_path parameter to public APIs](https://github.com/ray-project/ray/pull/33463)
246 | 4. PR 4/n: Move tests to use new path
247 | 5. PR 5/n: Change documentation to use new API everywhere
248 | 
249 | ## Test Plan and Acceptance Criteria
250 | 
251 | PR 3 will introduce new tests, but won't refactor the current tests. Thus, we ensure that our changes are fully backwards compatible.
252 | 
253 | PR 4 will then move existing unit tests to the new storage path API. This will also introduce a subset of old tests to make sure the legacy API continues to work. 
254 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/reps/2022-09-19-ray-on-spark.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | ### General Motivation
  3 | 
  4 | Supporting running Ray applications on spark cluster / databricks runtime,
  5 | Spark is a popular distributed computing framework and is used widely,
  6 | If supporting running Ray applications on spark cluster, user don't need to
  7 | setup a standalone ray cluster and it allows ray workloads and spark workloads
  8 | runs together.
  9 | 
 10 | #### Key requirements:
 11 | - Ray application resources (cpu/gpu/memory) allocation must respect spark job resource allocation.
 12 | - Little overhead compared with native Ray cluster.
 13 | - (Optional) Ray UI portal support.
 14 | 
 15 | ### Should this change be within `ray` or outside?
 16 | 
 17 | Yes. For better code maintemance.
 18 | 
 19 | ## Stewardship
 20 | ### Required Reviewers
 21 | 
 22 | - @jjyao
 23 | - @ericl
 24 | 
 25 | ### Shepherd of the Proposal (should be a senior committer)
 26 | 
 27 | - @jjyao
 28 | 
 29 | ## Design and Architecture
 30 | 
 31 | ### Prerequisites
 32 | 
 33 | - The user have an active Spark cluster (apache/spark >= 3.3)
 34 | 
 35 | - The user must install Ray packages on every node of the Spark cluster, e.g. via:
 36 | 
 37 |   ```
 38 |   pip install ray
 39 |   ```
 40 | 
 41 |   If the user's workload requires Ray add-ons, they must also be installed, e.g. via
 42 | 
 43 |   ```
 44 |   pip install "ray[debug,dashboard,tune,rllib,serve]"
 45 |   ```
 46 | 
 47 | ### How to setup Ray cluster over spark cluster ?
 48 | 
 49 | The architecture of a Spark cluster is as follows:
 50 | ![spark-cluster-overview](https://spark.apache.org/docs/latest/img/cluster-overview.png)
 51 | 
 52 | We intend to represent each Ray worker node as a long-running Spark
 53 | task in which Ray worker nodes have the same set of resources as the Spark task initiating each of 
 54 | the Ray workers. Multiple tasks can be run on a single Spark worker node, meaning that 
 55 | multiple Ray worker nodes can run on a single Spark worker node. 
 56 | Finally, we intend to run the Ray head node on the Spark driver node with a fixed resource allocation.
 57 | 
 58 | Note that if multiple ray-node-per-spark-worker issues occur (whether due to shared object store 
 59 | location for workers, dashboard visualization confusion, or other unforeseen issues), the implementation 
 60 | can be modified to only permit a single Ray cluster per Spark cluster. 
 61 | 
 62 | To clarify further, the following example demonstrates how a Ray cluster with 16 total worker CPU
 63 | cores can be launched on a Spark cluster, assuming that each Ray node requires 4 CPU cores and
 64 | 10GB of memory:
 65 | 
 66 | 1. On the Spark driver node, launch the Ray head node as follows:
 67 | 
 68 | ```
 69 | ray start --head --num-cpus=0
 70 | ```
 71 | 
 72 | This ensures that CPU processing tasks will not execute on the head node.
 73 | 
 74 | 2. Create a Spark job that executes all tasks in the spark job in standard map-reduce mode. Each
 75 | task is allocated a fixed number of CPUs and a fixed amount of memory. In this example, we will
 76 | allocate 4 CPU cores to each Spark task and at least 10GB of memory (by ensuring that each Spark
 77 | worker node has at least 10GB of memory reserved for every set of 4 CPU cores, as computed by
 78 | ``SPARK_WORKER_MEMORY`` / ``TASK_SLOTS_PER_WORKER`` under the assumption that the
 79 | ``spark.task.cpus`` configuration is set to ``4``).
 80 | 
 81 | 3. In each constituent Spark task, launch one Ray worker node and allocate to it
 82 | the full set of resources available to the Spark task (4 CPUs, 10GB of memory). Keep the Spark task
 83 | running until the Ray cluster is destroyed. The command used to start each Ray worker node is as
 84 | follows:
 85 | 
 86 | ```
 87 | ray start --num-cpus=X --num-gpus=Y --memory=Z --object-store-memory=M --address={head_ip}:{port}
 88 | ```
 89 | 
 90 | 4. After the Ray cluster is launched, the user's Ray application(s) can be submitted to the Ray
 91 | cluster via the Ray head node address / port.
 92 | 
 93 | 5. To shut down the Ray cluster, cancel the Spark job and terminate all Ray node
 94 | services that are running on the Spark driver and Spark workers.
 95 | 
 96 | Finally, by adjusting the resource allocation for Ray nodes and Spark tasks, this approach enables
 97 | users to run multiple Ray clusters in isolation of each other on a single Spark cluster, if desired.
 98 | 
 99 | 
100 | ### Key questions
101 | 
102 | 
103 | #### Launch Ray head node on spark driver node or spark task side ?
104 | 
105 | The head node will be initialized on the Spark Driver. To ensure that tasks will not be submitted 
106 | to this node, we will configure `num-cpus`=0 for the head node. 
107 | Additionally, in order to ensure that the head node is running on hardware sufficient to ensure 
108 | that requisite ray system processes, a validation of minimum hardware configuration requirements will 
109 | be performed prior to initialization of the ray cluster (minimum CPU cores, memory requirements)
110 | 
111 | 
112 | #### Shall we make ray worker node the same shape (assigned with the same resources amount) ?
113 | Yes. Otherwise, the Ray cluster setup will be nondeterministic,
114 | and you could get very strange results with bad luck on the node sizing.
115 | 
116 | 
117 | #### Do Ray nodes have a 1:1 mapping with Spark tasks or with Spark workers?
118 | In each Spark task, we will launch one Ray worker node. All spark tasks are allocated the same
119 | number of resources. As a result, all Ray worker nodes will have homogeneous resources because Spark
120 | task resource configurations must be homogeneous.
121 | 
122 | #### What is the recommended minimal resource allocation of a Ray node?
123 | 
124 | Each Ray node should have at least 4 CPU cores and 10GB of available memory. This corresponds to
125 | a Spark task configuration of:
126 | 
127 | - ``spark.task.cpus >= 4`` and
128 | - ``(SPARK_WORKER_MEMORY / TASK_SLOTS_PER_SPARK_WORKER) >= 10GB``
129 | 
130 | 
131 | #### On a shared spark cluster, shall we let each user launch individual ray cluster, or one user can create a shared ray cluster and other user can also use it ?
132 | I think we can provide 2 API to support both cases.
133 | Shared mode:
134 | ```ray.spark.get_or_create_shared_cluster```
135 | and
136 | ```ray.spark.init_private_cluster```
137 | 
138 | And if one shared mode ray cluster already exists, new ray cluster is not allowed to be launched.
139 | 
140 | 
141 | #### How to select the port number used by Ray node ?
142 | Ray node requires listening on several port, a spark cluster might be shared by many users,
143 | each users might setup their own ray cluster concurrently, to avoid port conflicts,
144 | we can randomly select free port and start Ray node service,
145 | if failed we can retry on another free port.
146 | 
147 | 
148 | #### How much memory shall we allocate to Ray node service ( set via ray script --memory option) ?
149 | Spark does not provide explicit API for getting task allowed memory,
150 | SPARK_WORKER_MEMORY / SPARK_WORKER_CORES * RAY_NODE_NUM_CPUS
151 | 
152 | 
153 | #### How will we allocate memory for the Ray object store?
154 | We will calculate the memory reserved for the Ray object store in each Ray node as follows:
155 | 
156 | 1. Calculate the available space in the ``/dev/shm`` mount.
157 | 2. Divide the ``/dev/shm`` available space by the number of Spark tasks (Ray nodes) that can run
158 |    concurrently on a single Spark worker node, obtaining an upper bound on the per-Ray-node object
159 |    store memory allocation.
160 | 3. To provide a buffer and guard against accidental overutilization of mount space, multiply the
161 |    upper bound from (2) by ``0.8``.
162 | 
163 | This allocation will ensure that we do not exhaust ``/dev/shm`` storage and accidentally spill to
164 | disk, which would hurt performance substantially.
165 | 
166 | 
167 | #### How to make ray respect spark GPU resource scheduling ?
168 | In spark task, we can get GPU IDs allocated to this task, so, when launching
169 | Ray node, besides specifying `--num-gpus` options, we need to specify `CUDA_VISIBLE_DEVICES`
170 | environment so that we can restrict Ray node only uses the GPUs allocated to corresponding spark tasks.
171 | 
172 | 
173 | #### How to support all options of ray start
174 | Provide a ray_node_options argument (dict type).
175 | 
176 | 
177 | #### Where should the code live: ray repo or spark repo?
178 | Ray repo.
179 | Putting in spark repo is hard, It would be very hard to pass vote.
180 | Past examples includes horovod, petastorm, xgboost, tensorflow-spark, all of them tried to put in spark repo but failed.
181 | 
182 | 
183 | #### Does it support autoscaling ?
184 | We will be investigating and testing the feasibility of supporting autoscaling of ray worker nodes as 
185 | additional spark task slots become available upon the initiation of additional spark worker nodes.
186 | 
187 | #### What is the fault tolerance of Barrier Execution mode and how are failures handled?
188 | The current design uses standard job scheduling in Spark and will initiate task submission to new 
189 | Spark worker nodes in the event that a task group fails, similar to the functionality within Spark 
190 | during a task retry.
191 | 
192 | #### What’s the level of isolation spark provides ?
193 | Spark task execution provides process-level isolation.
194 | No VM/container level isolation. 
195 | 
196 | 
197 | #### Custom resources support
198 | TPUs, specific accelerators, etc.
199 | Uncommon scenario, We can do it later?. Spark doesn't support this, spark can only schedule CPU, GPU, memory.
200 | 
201 | 
202 | ### API Proposal
203 | 
204 | #### Initialize ray cluster on spark API
205 | 
206 | ```
207 | ray_cluster_on_spark = ray.spark.init_cluster(num_cpus, num_gpus, memory)
208 | ```
209 | 
210 | Or
211 | 
212 | ```
213 | ray_cluster_on_spark = ray.spark.init_cluster(num_spark_tasks)
214 | ```
215 | 
216 | Initialize a ray cluster on the spark cluster, the arguments specified the ray cluster can use how much cpus / gpus / memory.
217 | And connect to the cluster.
218 | 
219 | Returns an instance of type `RayClusterOnSpark``
220 | 
221 | The ray head node may be of a different configuration compared to that of the worker nodes, but 
222 | the ray workers will be homogenous with respect to CPU / GPU and memory available for heap and 
223 | object store utilization.
224 | A best-effort mapping of available Spark cluster resources to requested ray cluster resources for 
225 | worker nodes will be performed. A validation check during initialization will be done to:
226 | - In `safe_mode'=True raise an Exception if the configured resources are insufficient, providing detailed reconciliation steps for the user.
227 | - In `safe_mode`=False, log a warning of potentially insufficient resources with instructions on how to configure the Spark cluster to avoid this situation.
228 | 
229 | 
230 | e.g., your case: (8-CPU nodes, or 4-GPU nodes),
231 | suppose on a spark cluster with config:
232 | 
233 | spark.task.cpus 2 # means 2 cpu cores per spark task
234 | spark.task.resource.gpu.amount 1 # means 1 GPU per spark task
235 | Then the Ray on spark routine will create a spark job with 4 spark tasks running concurrently,
236 | which means it books (8-CPU nodes, or 4-GPU nodes) resources for ray cluster.
237 | 
238 | 
239 | ### Shutdown ray cluster on spark
240 | 
241 | When user want to shutdown the ray cluster, he can call:
242 | 
243 | ```
244 | ray_cluster_on_spark.shutdown()
245 | ```
246 | 
247 | It will terminate the ray cluster.
248 | On databricks notebook, we can make databricks runtime automatically calls `ray_cluster_handler.shutdown()` when a notebook is detached from spark cluster. We need to install a hook to achieve this.
249 | 
250 | 
251 | ## Compatibility, Deprecation, and Migration Plan
252 | 
253 | Support apache/spark >= 3.3 and latest ray version.
254 | 
255 | 
256 | ## Test Plan and Acceptance Criteria
257 | 
258 | ### Test Plan
259 | 
260 | - Setup ray cluster on spark cluster of configs and then run ray applications
261 |   - 1 cpu per spark task
262 |   - 1 cpu and 1 gpu per spark task
263 |   - multiple cpu / gups per spark task
264 | 
265 | - Concurrently setup multiple ray cluster on a spark cluster
266 | 
267 | ### Acceptance Criteria
268 | 
269 | - Ray application comply with spark cluster resource (cpu / gpu / memory) allocation.
270 | - Less than 30% performance overhead.
271 | 


--------------------------------------------------------------------------------
/reps/2023-08-18-serve-java-dag-api.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | ### General Motivation
  3 | Compared to the latest Serve API in Python, Ray Serve Java lacks a major version update. Currently, the main form of the Serve API is the DAG API, which involves networking deployments through binding and deploying the graph. For example:
  4 | ```python
  5 | import ray
  6 | from ray import serve
  7 | from ray.serve.handle import RayServeHandle, DeploymentHandle
  8 | 
  9 | 
 10 | @serve.deployment
 11 | def preprocess(inp: int) -> int:
 12 |     return inp + 1
 13 | 
 14 | 
 15 | @serve.deployment
 16 | class Model:
 17 |     def __init__(self, preprocess_handle: RayServeHandle, increment: int):
 18 |         self.preprocess_handle: DeploymentHandle = preprocess_handle.options(use_new_handle_api=True)
 19 |         self.increment = increment
 20 | 
 21 |     async def predict(self, inp: int) -> int:
 22 |         preprocessed = await self.preprocess_handle.remote(inp)
 23 |         return preprocessed + self.increment
 24 | 
 25 | app = Model.bind(preprocess.bind(), increment=2)
 26 | handle = serve.run(app)
 27 | assert ray.get(handle.predict.remote(1)) == 4
 28 | 
 29 | ```
 30 | Looking back at Java, we hope that Java can keep up with the features of the Serve API, allowing Java developers to deploy Java projects as Serve deployments and compose multiple deployments to accomplish complex online computations.
 31 | ### Should this change be within `ray` or outside?
 32 | Main `ray` project. A part of java/serve.
 33 | ## Stewardship
 34 | 
 35 | ### Required Reviewers
 36 | @sihanwang41 @edoakes
 37 | 
 38 | ### Shepherd of the Proposal (should be a senior committer)
 39 | @sihanwang41 @edoakes
 40 | 
 41 | ## Design and Architecture
 42 | 
 43 | ### Update Java User API to be Consistent with Python
 44 | A standard Java deployment demo is shown below:
 45 | ```java
 46 | import io.ray.serve.api.Serve;
 47 | import io.ray.serve.deployment.Application;
 48 | import io.ray.serve.handle.DeploymentHandle;
 49 | 
 50 | public class DeploymentDemo {
 51 |   private String msg;
 52 | 
 53 |   public DeploymentDemo(String msg) {
 54 |     this.msg = msg;
 55 |   }
 56 | 
 57 |   public String call() {
 58 |     return msg;
 59 |   }
 60 | 
 61 |   public static void main(String[] args) {
 62 |     Application deployment =
 63 |         Serve.deployment().setDeploymentDef(DeploymentDemo.class.getName()).bind();
 64 |     DeploymentHandle handle = Serve.run(deployment).get();
 65 |     System.out.println(handle.remote().result());
 66 |   }
 67 | }
 68 | 
 69 | ```
 70 | In this demo, a deployment is defined through the `bind` method, and it is deployed using the `Serve.run` API.
 71 | Furthermore, a deployment can bind other deployments, and users can use the deployment input parameters in a similar way to the `DeploymentHandle`. For example:
 72 | ```java
 73 | import io.ray.serve.api.Serve;
 74 | import io.ray.serve.deployment.Application;
 75 | import io.ray.serve.handle.DeploymentHandle;
 76 | import io.ray.serve.handle.DeploymentResponse;
 77 | 
 78 | public class Driver {
 79 |   private DeploymentHandle modelAHandle;
 80 |   private DeploymentHandle modelBHandle;
 81 | 
 82 |   public Driver(DeploymentHandle modelAHandle, DeploymentHandle modelBHandle) {
 83 |     this.modelAHandle = modelAHandle;
 84 |     this.modelBHandle = modelBHandle;
 85 |   }
 86 | 
 87 |   public String call(String request) {
 88 |     DeploymentResponse responseA = modelAHandle.remote(request);
 89 |     DeploymentResponse responseB = modelBHandle.remote(request);
 90 |     return (String) responseA.result() + responseB.result();
 91 |   }
 92 | 
 93 |   public static class ModelA {
 94 |     public String call(String msg) {
 95 |       return msg;
 96 |     }
 97 |   }
 98 | 
 99 |   public static class ModelB {
100 |     public String call(String msg) {
101 |       return msg;
102 |     }
103 |   }
104 | 
105 |   public static void main(String[] args) {
106 |     Application modelA = Serve.deployment().setDeploymentDef(ModelA.class.getName()).bind();
107 |     Application modelB = Serve.deployment().setDeploymentDef(ModelB.class.getName()).bind();
108 | 
109 |     Application driver =
110 |         Serve.deployment().setDeploymentDef(Driver.class.getName()).bind(modelA, modelB);
111 |     Serve.run(driver);
112 |   }
113 | }
114 | 
115 | ```
116 | In this example, the modelA and modelB are defined as two Deployments, and the driver is instantiated with the corresponding `DeploymentHandle` of these two Deployments. When `call` is executed, both models are invoked. Additionally, it is evident that `DeploymentHandle.remote` returns `DeploymentResponse` instead of `ObjectRef`. The result can be accessed through `DeploymentResponse.result`.
117 | 
118 | ### Deploying Deployments of the Other Languages through Python API
119 | In another REP ([Add Cpp Deployment in Ray Serve](https://github.com/ray-project/enhancements/pull/34)), it is mentioned how to deploy C++ deployments through Python. Deploying Java deployments through Python is similar. Since Java and C++ do not have the decorator mechanism like Python, a straightforward way is to directly use the `serve.deployment` API (with the addition of a new `language` parameter):
120 | 
121 | ```python
122 | deployment = serve.deployment('io.ray.serve.ExampleDeployment', name='my_deployment', language=JAVA)
123 | 
124 | ```
125 | ### Deploying through the Config File
126 | Based on the mentioned API, we can deploy a Java code file that orchestrates an application using the `serve run` command. For example, consider the following Text.java file:
127 | ```java
128 | import io.ray.serve.api.Serve;
129 | import io.ray.serve.deployment.Application;
130 | import io.ray.serve.handle.DeploymentHandle;
131 | 
132 | public class Text {
133 | 
134 |   public static class Hello {
135 |     public String call() {
136 |       return "Hello";
137 |     }
138 |   }
139 | 
140 |   public static class World {
141 |     public String call() {
142 |       return " world!";
143 |     }
144 |   }
145 | 
146 |   public static class Ingress {
147 |     private DeploymentHandle helloHandle;
148 |     private DeploymentHandle worldHandle;
149 | 
150 |     public Ingress(DeploymentHandle helloHandle, DeploymentHandle worldHandle) {
151 |       this.helloHandle = helloHandle;
152 |       this.worldHandle = worldHandle;
153 |     }
154 | 
155 |     public String call() {
156 |       return (String) helloHandle.remote().result() + worldHandle.remote().result();
157 |     }
158 |   }
159 | 
160 |   public static Application app() {
161 |     Application hello = Serve.deployment().setDeploymentDef(Hello.class.getName()).bind();
162 |     Application world = Serve.deployment().setDeploymentDef(World.class.getName()).bind();
163 | 
164 |     Application app =
165 |         Serve.deployment().setDeploymentDef(Ingress.class.getName()).bind(hello, world);
166 |     return app;
167 |   }
168 | }
169 | 
170 | ```
171 | This code orchestrates an application within a static method named `app`. The CLI command for its deployment is as follows:
172 | ```shell
173 | $ serve run io.ray.serve.repdemo.Text:app --language=java
174 | ```
175 | 
176 | Additionally, similar to the Python `app_builder`, a Java application also supports custom parameters. For example:
177 | ```java
178 | import io.ray.serve.api.Serve;
179 | import io.ray.serve.deployment.Application;
180 | import java.util.Map;
181 | 
182 | public class Hello {
183 | 
184 |   public static class HelloWorld {
185 |     private String message;
186 | 
187 |     public HelloWorld(String message) {
188 |       this.message = message;
189 |     }
190 | 
191 |     public String call() {
192 |       return message;
193 |     }
194 |   }
195 | 
196 |   public static Application appBuilder(Map<String, String> args) {
197 |     return Serve.deployment()
198 |         .setDeploymentDef(HelloWorld.class.getName())
199 |         .bind(args.get("message"));
200 |   }
201 | }
202 | 
203 | ```
204 | 
205 | The `appBuilder` method takes a `Map` as input parameter, from which users can retrieve the required `message` parameter. The syntax for deployment is as follows:
206 | 
207 | 
208 | ```shell
209 | $ serve run io.ray.serve.repdemo.Hello:appBuilder --language=java message="Hello from CLI"
210 | ```
211 | 
212 | Furthermore, it is worth mentioning that `appBuilder` also supports user-defined input parameter of custom types, as long as the type includes the specified attributes. For example:
213 | 
214 | ```java
215 | import io.ray.serve.api.Serve;
216 | import io.ray.serve.deployment.Application;
217 | import java.util.Map;
218 | 
219 | public class Hello {
220 | 
221 |   public static class HelloWorldArgs {
222 |     private String message;
223 | 
224 |     public String getMessage() {
225 |       return message;
226 |     }
227 | 
228 |     public void setMessage(String message) {
229 |       this.message = message;
230 |     }
231 |   }
232 | 
233 |   public static class HelloWorld {
234 |     private String message;
235 | 
236 |     public HelloWorld(String message) {
237 |       this.message = message;
238 |     }
239 | 
240 |     public String call() {
241 |       return message;
242 |     }
243 |   }
244 | 
245 |   public static Application typedAppBuilder(HelloWorldArgs args) {
246 |     return Serve.deployment().setDeploymentDef(HelloWorld.class.getName()).bind(args.getMessage());
247 |   }
248 | }
249 | 
250 | ```
251 | 
252 | ```shell
253 | $ serve run io.ray.serve.repdemo.Hello:typedAppBuilder --language=java message="Hello from CLI"
254 | ```
255 | 
256 | For the aforementioned `Hello.java` file, we can generate the corresponding Serve config file using the `serve build` command:
257 | 
258 | ```shell
259 | $ serve build io.ray.serve.repdemo.Hello:typedAppBuilder --language=java -o serve_config.yaml
260 | ```
261 | 
262 | The generated config file looks like this:
263 | ```yaml
264 | proxy_location: EveryNode
265 | 
266 | http_options:
267 |   host: 0.0.0.0
268 |   port: 8000
269 | 
270 | grpc_options:
271 |   port: 9000
272 |   grpc_servicer_functions: []
273 | 
274 | applications:
275 | - name: app
276 |   route_prefix: /
277 |   import_path: io.ray.serve.repdemo.Text:typedAppBuilder
278 |   language: java
279 |   runtime_env: {}
280 |   deployments:
281 |   - name: HelloWorld
282 | 
283 | ```
284 | 
285 | We can modify the config file to pass the arguments required by the `appBuilder`, and then deploy the application based on the serve config using `serve run`. For the above config file, we add a parameter `message` to it:
286 | 
287 | ```yaml
288 | proxy_location: EveryNode
289 | 
290 | http_options:
291 |   host: 0.0.0.0
292 |   port: 8000
293 | 
294 | grpc_options:
295 |   port: 9000
296 |   grpc_servicer_functions: []
297 | 
298 | applications:
299 | - name: app
300 |   route_prefix: /
301 |   import_path: io.ray.serve.repdemo.Text:typedAppBuilder
302 |   args:
303 |     message: "Hello from config"
304 |   language: java
305 |   runtime_env: {}
306 |   deployments:
307 |   - name: HelloWorld
308 | 
309 | ```
310 | 
311 | Then deploy the config file:
312 | 
313 | ```shell
314 | $ serve run serve_config.yaml
315 | ```
316 | 
317 | ### Serve Handle C++ Core
318 | 
319 | In the design of C++ Deployment, it also includes the C++ implementation of Serve Handle. After the implementation, it can be reused as the core of Serve Handle by other languages (Python and Java) to avoid maintaining duplicate logic in the three languages. For the complete design, we will continue to supplement it in the "[Cpp Deployment Design](https://github.com/ray-project/enhancements/pull/34)" or another new document.
320 | 
321 | ## Compatibility, Deprecation, and Migration Plan
322 | In Java, the old API will be marked with the @Deprecated annotation, for example:
323 | ```java
324 | public class Deployment {
325 |   @Deprecated
326 |   public void deploy(boolean blocking) {
327 |     Serve.getGlobalClient()
328 |         .deploy(
329 |             name,
330 |             deploymentDef,
331 |             initArgs,
332 |             rayActorOptions,
333 |             config,
334 |             version,
335 |             prevVersion,
336 |             routePrefix,
337 |             url,
338 |             blocking);
339 |   }
340 | }
341 | ```
342 | This is also done to maintain consistency with the Python API, and to allow for easy removal of the deprecated API in the future.
343 | ## Test Plan and Acceptance Criteria
344 | Related test cases will be provided under ray/java/serve, and they will cover the three scenarios mentioned above.
345 | ## (Optional) Follow-on Work
346 | - Optimize the code by removing unused components and improving cross-language parameter handling.
347 | - Support the usage of streaming.
348 | 


--------------------------------------------------------------------------------
/reps/2023-8-18-ray-on-spark-autoscaling.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | ### General Motivation
  3 | 
  4 | [Ray on spark](https://github.com/ray-project/enhancements/blob/main/reps/2022-09-19-ray-on-spark.md) supports launching the Ray cluster over spark cluster. But currently, Ray on spark does not support Ray cluster autoscaling. It only supports starting a Ray cluster with fixed number of worker nodes.
  5 | 
  6 | The purpose of Ray on spark autoscaling is to support following scenarios:
  7 | 
  8 | - For a standalone mode apache spark cluster, assuming multiple users share the spark cluster, and one user submits a spark application that runs a Ray on spark cluster. If Ray on spark autoscaling is supported, and the spark application enables [dynamic resources allocation](https://spark.apache.org/docs/3.4.1/job-scheduling.html#dynamic-resource-allocation), then the Ray on spark cluster starts with zero Ray worker nodes. When Ray jobs are submitted, the Ray on spark cluster requests more Ray worker nodes based on demands (i.e. scales up). It will launch more spark executors for this spark application. After Ray jobs complete, then the Ray on spark cluster scales down. It will terminate idle Ray worker nodes and then trigger some idle spark executor termination. So that with Ray on spark autoscaling feature enabled, when the Ray on spark cluster is idle, it does not occupy spark worker resources.
  9 | 
 10 | - For a databricks spark cluster that enables [databricks cluster autoscaling](https://docs.databricks.com/en/clusters/cluster-config-best-practices.html#autoscaling), with Ray on spark autoscaling enabled, we can start the databricks spark cluster with zero initial spark workers. When Ray jobs are submitted, it triggers Ray on spark cluster scaling up and the underlying databricks spark cluster scaling up. When Ray jobs complete, it triggers Ray on spark cluster scaling down and the underlying databricks spark cluster scaling down. This feature helps databricks users to improve cloud resource utilization and save cost.
 11 | 
 12 | 
 13 | #### Key requirements:
 14 | - Supports Apache Spark cluster that is configured with [standalone mode](https://spark.apache.org/docs/3.4.1/spark-standalone.html).
 15 | - Support Apache Spark application that enables dynamic resource allocation.
 16 | - Supports databricks spark cluster that enables spark worker autoscaling.
 17 | 
 18 | 
 19 | ### Should this change be within `ray` or outside?
 20 | Within Ray. For better code maintenance.
 21 | 
 22 | 
 23 | ## Stewardship
 24 | ### Required Reviewers
 25 | 
 26 | - @jjyao
 27 | - @ericl
 28 | 
 29 | ### Shepherd of the Proposal (should be a senior committer)
 30 | 
 31 | - @jjyao
 32 | 
 33 | ## Design and Architecture
 34 | 
 35 | ### Prerequisites
 36 | 
 37 | - The user has an active Spark cluster (apache/spark >= 3.3).
 38 | 
 39 | - The user must install Ray packages on every node of the Spark cluster, e.g. via:
 40 | 
 41 |   ```
 42 |   pip install ray
 43 |   ```
 44 | 
 45 |   If the user's workload requires Ray add-ons, they must also be installed, e.g. via
 46 | 
 47 |   ```
 48 |   pip install "ray[debug,dashboard,tune,rllib,serve]"
 49 |   ```
 50 | 
 51 | ### Ray on spark autoscaling interfaces
 52 | 
 53 | Integrate autoscaling feature into existing `ray.util.spark.cluster_init.setup_ray_cluster` API,
 54 | The following new arguments are added:
 55 | 
 56 |  - autoscale: bool (default False)
 57 |    If True, enable autoscaling, the number of initial Ray worker nodes 
 58 |    is zero, and the maximum number of Ray worker nodes is set to
 59 |    `num_worker_nodes`. Default value is False.
 60 |  - autoscale_upscaling_speed: Optional[float] (default 1.0)
 61 |    The upscaling speed configuration, see explanation in [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/configuring-autoscaling.html#upscaling-and-downscaling-speed).
 62 |  - autoscale_idle_timeout_minutes: Optional[float] (default 1.0)
 63 |    The downscaling speed configuration, see explanation in [here](https://docs.ray.io/en/latest/cluster/vms/user-guides/configuring-autoscaling.html#upscaling-and-downscaling-speed).
 64 |  
 65 | 
 66 | ### Implementation
 67 | 
 68 | To start a Ray cluster that enables autoscaling, firstly, prepares a config file like following:
 69 | 
 70 | ```yaml
 71 | cluster_name: spark
 72 | max_workers: 8  # maximum number of workers
 73 | provider:
 74 |     # This is the key of registered Ray autoscaling `NodeProvider` class that is
 75 |     # used as the backend of launching Ray worker nodes.
 76 |     type: spark
 77 |     # This must be true since the nodes share the same ip!
 78 |     use_node_id_as_ip: True
 79 |     disable_node_updaters: True
 80 |     disable_launch_config_check: True
 81 | available_node_types:
 82 |     ray.head.default:  # Ray head configurations
 83 |         resources:
 84 |             CPU: 4
 85 |             GPU: 0
 86 |         node_config: {}
 87 |         max_workers: 0
 88 |     ray.worker:  # Ray worker configurations
 89 |         resources:
 90 |             CPU: 4
 91 |             GPU: 2
 92 |             memory: 2000000000
 93 |             object_store_memory: 1000000000
 94 |         node_config: {}
 95 |         min_workers: 0  # minimum number of workers in this group
 96 |         max_workers: 4  # maximum number of workers in this group
 97 | head_node_type: ray.head.default
 98 | upscaling_speed: 1.0
 99 | idle_timeout_minutes: 0.1
100 | ```
101 | 
102 | Then, starts a Ray head node via following command (some options are omitted):
103 | 
104 | ```shell
105 | ray start --head --autoscaling-config=/path/to/autoscaler-config.yaml
106 | ```
107 | 
108 | The above command starts a Ray head node, it also starts a Ray autoscaler with
109 | above YAML configuration, it configures autoscaler with "spark" `NodeProvider`
110 | backend. When Ray cluster is requested to scale up,
111 | the configured `NodeProvider` is called to trigger Ray worker node setup.
112 | 
113 | This is `NodeProvider` [interfaces](https://github.com/ray-project/ray/blob/e1c48ab8b1225daa44ea63ed1e7dc956b0e7f411/python/ray/autoscaler/node_provider.py#L13).
114 | 
115 | ### How to implement spark node provider?
116 | 
117 | For each Ray worker node, we start a spark job with only one spark
118 | task to hold this Ray worker node, and we set a unique spark job group ID to
119 | this spark job. When we need to terminate this Ray worker node, we cancel
120 | the corresponding spark job by canceling corresponding spark job group, so that
121 | the spark job and its spark task are killed, then it triggers the Ray worker node termination.
122 | 
123 | Because we can only cancel spark job not spark task when we need to scale down a Ray worker node,
124 | we have to have one spark job for each Ray worker node.
125 | 
126 | One thing critical is that spark node provider runs in autoscaler process that is different process
127 | than the one that executes "setup_ray_cluster" API. User calls "setup_ray_cluster" in
128 | spark application driver node, and the semantic is "setup_ray_cluster" requests spark resources from
129 | this spark application. Internally, "setup_ray_cluster" should use "spark session" instance
130 | to request spark application resources. But spark node provider runs in another python process,
131 | how to share current process spark session to the separate NodeProvider process? The solution is
132 | setting up a spark job server that runs inside spark application driver process (the process
133 | that calls "setup_ray_cluster" API), and in NodeProvider process, it sends RPC request to
134 | the spark job server for creating spark jobs in the spark application. Note that we cannot
135 | create another spark session in NodeProvider process, because if doing so, it means we create
136 | another spark application, and then it causes NodeProvider requests resources belonging to
137 | the new spark application, but we need to ensure all requested spark resources belong to
138 | the original spark application that calls "setup_ray_cluster" API.
139 | 
140 | The overall architecture is:
141 | 
142 | ![ray on spark autoscaling architecture](https://github.com/ray-project/ray/assets/19235986/9f1f30bb-e395-4d98-8c8d-5fa60675bafa)
143 | 
144 | 
145 | ### Key questions
146 | 
147 | 
148 | #### We can use `ray up` command to set up a ray cluster with autoscaling, why we don't call `ray up` command in ray on spark autoscaling implementation ?
149 | 
150 | In ray on spark, we only provides a python API `setup_ray_cluster` and it does not have a CLI. So in `setup_ray_cluster` implementation, we need to generate autoscale config YAML file according to `setup_ray_cluster` argument values, and then launch the ray head node with "--autoscaling-config" option. In this way ray on spark code can manage ray head node process and ray worker nodes easier.
151 | 
152 | The second reason is ray-on-spark autoscaling only supports a very limited subset of cluster.yaml: e.g. only a single worker type, no docker, no ssh so it's easier and less confusing to use a restricted API.
153 | 
154 | #### How to integrate Ray on spark NodeProvider backend with databricks spark cluster autoscaling feature ?
155 | 
156 | See [Cluster size and autoscaling](https://docs.databricks.com/en/clusters/configure.html#autoscaling) and [Enhanced Autoscaling](https://docs.databricks.com/en/delta-live-tables/auto-scaling.html) for configuring databricks spark cluster autoscaling.
157 | 
158 | Note that scaling up or down is triggered automatically. In Ray on spark NodeProvider,
159 | we just need to create or cancel spark jobs on demand, then databricks cluster
160 | will automatically scale up or down according to cluster resources utilization ratio.
161 | 
162 | 
163 | #### How to integrate Ray on spark NodeProvider backend with apache/spark application dynamic resource allocation on standalone mode spark cluster ?
164 | 
165 | See [dynamic resources allocation](https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation) for configuring spark application dynamic resource allocation.
166 | 
167 | Note that scaling up or down is triggered automatically. In Ray on spark NodeProvider
168 | backend, we just need to create or cancel spark jobs by demand, then the spark application executors will automatically scale up or down according to this spark application resources utilization ratio.
169 | 
170 | 
171 | #### How to make `NodeProvider` backend support multiple Ray worker nodes running on the same virtual machine ?
172 | 
173 | By default, `NodeProvider` implementation implement `internal_ip` and `external_ip` methods and convert `node_id` to IP, and different node must have different IP address,
174 | and Ray autoscaler internal code uses IP to track specific node status.
175 | 
176 | But, for Ray on spark, one virtual machine (running a spark worker) might start multiple
177 | ray worker nodes. It breaks rules above and the workaround is that we set `use_node_id_as_ip` configuration to `True` in autoscaler configuration file:
178 | 
179 | ```yaml
180 | provider:
181 |     type: spark
182 |     use_node_id_as_ip: True
183 |     ...
184 | ```
185 | 
186 | and we use incremental number as the node id used in NodeProvider implementation code. Before creating Ray worker nodes, we set a special resource `NODE_ID_AS_RESOURCE` and the total quantity is the value of the node id. Ray autoscaler code can then track these Ray nodes status correctly.
187 | See related logic code [here](https://github.com/ray-project/ray/blob/7a8b6a1b52488922fc27a1bc2a01d40f87d36af6/python/ray/autoscaler/_private/monitor.py#L304C34-L304C34) for reference.
188 | 
189 | 
190 | #### Ray autoscaler supports setting multiple Ray worker groups, each Ray worker group has its individual CPU / GPU / memory resources configuration, and its own minumum / maximum worker number setting for autoscaling. Shall we support this feature in Ray on spark autoscaling?
191 | 
192 | Current use-cases only require all Ray worker nodes having the same shape,
193 | we can support this in future if customer requires it.
194 | 
195 | 
196 | #### What's the default Ray on spark minimum worker number we should use ?
197 | 
198 | I propose to set it to zero. Ray worker node launching is very quick, and setting it to
199 | zero increases cluster resources utilization ratio especially when the Ray on spark cluster is often idle.
200 | 
201 | 
202 | #### Apache/spark dynamic resource allocation also supports Spark over Kubernetes/Yarn, can we make Ray on spark autoscaling supports Spark over Kubernetes/Yarn ?
203 | 
204 | It is doable, but currently Ray on spark only supports apache/spark standalone mode. Once Ray on spark supports Spark over Kubernetes/Yarn, we can make Ray on spark autoscaling feature support it.
205 | 


--------------------------------------------------------------------------------
/reps/2023-06-06-simplify-sync.md:
--------------------------------------------------------------------------------
  1 | # Consolidated persistence API for Ray Train/Tune
  2 | 
  3 | ## Summary
  4 | 
  5 | Standardize on the `storage_path` option as the sole implementation of distributed sync in Ray Train(/Tune). Deprecate the legacy head node syncer code, and add a `storage_filesystem` option for users to customize the underlying pyarrow filesystem used for `storage_path` sync.
  6 | 
  7 | Relatedly, simplify the Train `Checkpoint` API to also standardize on `pyarrow.fs` as its backing implementation.
  8 | 
  9 | ### General Motivation
 10 | 
 11 | The motivation for this change is three-fold.
 12 | 
 13 | First, it cuts down on the number of alternative options and implementations users have for persistent storage in Ray Train and Tune. This materially reduces the docs and examples (i.e., going from multiple pages of persistence-related docs to one page).
 14 | 
 15 | Second, it reduces the implementation and API surface that the ML team has to maintain. Many of these storage abstractions were introduced in the initial years when Ray was created. At the time, filesystem abstractions for Python were less prevalent. Today, `pyarrow.fs` (in conjunction with fsspec) are industry standards for accessing remote storage, and Ray already has a hard dependency on pyarrow in Ray Data.
 16 | 
 17 | Third, with the growth of LLMs and other large models, we are seeing larger and larger checkpoints being used. This motivates spending more effort in cleaning up and hardening this critical path, and it may cause more pain for LLM users than for users for smaller models. Simplifying the design here also enables us to extend support for more advanced features such as multi-rank checkpointing withour incurring excessive technical debt.
 18 | 
 19 | These three pain points have grown significantly over the past year, based both on user and internal feedback. This shouldn't be too surprising, as the persistent storage APIs have never been significantly revised before, beyond introduction of the new `storage_path` option, and similarly the Checkpoint API is in beta.
 20 | 
 21 | ### Should this change be within `ray` or outside?
 22 | 
 23 | main `ray` project. Changes are made to Ray Train and Tune.
 24 | 
 25 | ## Stewardship
 26 | 
 27 | ### Required Reviewers
 28 | The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
 29 | 
 30 | @matthewdeng, @krfricke
 31 | 
 32 | ### Shepherd of the Proposal (should be a senior committer)
 33 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
 34 | 
 35 | @pcmoritz
 36 | 
 37 | ## User Story
 38 | 
 39 | Persistence will center around the recently introduced `storage_path` option.
 40 | 
 41 | ### Persistent trial directory
 42 | 
 43 | Ray Train provides the abstraction of a persistent trial directory. Each trial has its current working directory set to a trial specific directory (e.g., `~/ray_results/experiment_1/trial_1`). Train will synchronize this trial directory with persistent storage specified via the `storage_path` option.
 44 | 
 45 | This directory is persistent since Train is syncing it to persistent storage for the user. The experiment driver will record experiment results and other metadata to a similar directory on the head node, which will also synchronize to persistent storage.
 46 | 
 47 | ![overview](user-story.png)
 48 | 
 49 | ### Checkpoints
 50 | 
 51 | While trials can write arbitrary data to the persistent trial directory, they cannot be guaranteed they can read it back on checkpoint/restore. Data that is needed for resume must be recorded via the `train.report(checkpoint)` API.
 52 | 
 53 | For example, this is how you could record a Torch checkpoint:
 54 | 
 55 | ```python
 56 |    torch.save(model, "path/to/dir")
 57 |    train.report(checkpoint=Checkpoint.from_directory("path/to/dir"))
 58 | ```
 59 | 
 60 | 
 61 | Train will take ownership of the specified checkpoint data, deleting the directory after the checkpoint upload finishes. The checkpoint data will then be managed by Train (e.g., Train may restore only the latest checkpoint or delete previous checkpoints to save space according to checkpoint policy).
 62 | 
 63 | The files of the recorded checkpoint can be accessed via `result.checkpoint`. The `Checkpoint` object itself is a logical tuple of `(path, pyarrow.fs.FileSystem)`, and this tuple is immutable. Users can also get and set arbitrary metadata to these checkpoints (e.g., preprocessor configs, model settings, etc), which will be recorded in a `metadata.json` file. We enforce JSON format instead of pickle to ensure metadata is accessible across Python versions, etc, though users can manually pickle and store arbitrary data as values.
 64 | 
 65 | The high level API of the Checkpoint class is as follows:
 66 | 
 67 | ```python
 68 | class Checkpoint:
 69 |     def __init__(self, path: str, filesystem: Optional[pyarrow.fs.FileSystem]):
 70 | 
 71 |         # Public attributes.
 72 |         self.path = path
 73 |         self.filesystem = filesystem
 74 | 
 75 |         # For de-dup optimization.
 76 |         self.uuid = uuid.uuid4()
 77 | 
 78 |         # Auto-resolve.
 79 |         if not filesystem:
 80 |             self.filesystem, self.path = pyarrow.fs.FileSystem.from_uri(path)
 81 | 
 82 |     def get_metadata(self) -> Dict[str, Any]:
 83 |         """Return the metadata dict stored with the checkpoint.
 84 | 
 85 |         If no metadata is stored, an empty dict is returned."""
 86 |         pass
 87 | 
 88 |     def update_metadata(self, metadata: Dict[str, Any]) -> None:
 89 |         """Update the metadata stored with this checkpoint.
 90 | 
 91 |         The values of keys that already exist will be overwritten.
 92 |         """
 93 |         pass
 94 | 
 95 |     def reset_metadata(self, metadata: Dict[str, Any]) -> None:
 96 |         """Overwrite the entire metadata with this dict."""
 97 |         pass
 98 | 
 99 |     @staticmethod
100 |     def from_directory(path: Union[str, os.PathLike]) -> "Checkpoint":
101 |         """Create checkpoint object from a local directory.
102 | 
103 |         This is a shorthand for constructing a checkpoint manually via
104 |         `Checkpoint(path, pyarrow.fs.LocalFileSystem())`.
105 | 
106 |         Args:
107 |             path: Directory containing checkpoint data. The caller promises to
108 |                 not delete the directory (gifts ownership of the directory to this
109 |                 Checkpoint).
110 | 
111 |         Returns:
112 |             Checkpoint: checkpoint object.
113 |         """
114 |         pass
115 | 
116 |     def to_directory(self, path: Optional[str] = None) -> str:
117 |         """Write checkpoint data to directory.
118 | 
119 |         Note: When this method is called concurrently by multiple processes on
120 |         the same original Checkpoint object, we deduplicate the data fetches by
121 |         the checkpoint uuid.
122 | 
123 |         Args:
124 |             path: Target directory to restore data in. If not specified,
125 |                 will create a temporary directory.
126 | 
127 |         Returns:
128 |             str: Directory containing checkpoint data.
129 |         """
130 |         pass
131 | ```
132 | 
133 | We may introduce additional convenience APIs such as `as_directory()`, which just wrap these high level APIs.
134 | 
135 | Note that the Checkpoint object is pretty simple and is a thin wrapper around a `(path, filesystem)` tuple. We explicitly do not want any framework-specific features in Checkpoint (e.g., framework-specific accessors or framework-specific subclasses). This should make it lightweight to understand and maintain.
136 | 
137 | ### Multi-rank checkpoints
138 | 
139 | Checkpoints can be recorded from multiple ranks. By default, only checkpoint data from rank zero is preserved. Data from all ranks can be retained via a `keep_all_ranks` option. Train will merge checkpoint data from all ranks into a single directory for the checkpoint number. The reason is this: users can always namespace their files/dirs by rank by the `train.context.world_rank` manually, so there isn't any reason why Train should do it for them. In fact, certain frameworks such as Lightning assume merged directories and could break if we enforced per-rank subdirectories.
140 | 
141 | Similarly, for restoration the new Checkpoint API allows users to customize restoration strategy. For example, if they wanted to implement per-rank restore, they could call `pyarrow.fs.copy_files()` on selected files within the checkpoint path, rather than using the naive `Checkpoint.to_directory()` call.
142 | 
143 | ### FAQ
144 | 
145 | - What's in this persistent trial directory?
146 | 
147 |     This directory contains metric files written by Train loggers, such as JSON metrics, checkpoint data managed by Train, as well as arbitrary artifacts created by a trial.
148 | 
149 | - Can I exclude files written in the trial dir from sync? (e.g., core dumps)
150 | 
151 |     Yes, the SyncConfig will allow you to exclude files by pattern.
152 | 
153 | - What guarantees does Train make about persistence?
154 | 
155 |     Train will sync this directory to persistent storage according to the SyncConfig settings for the run (e.g., once every 10 min or on checkpoint).
156 | 
157 | - What can `storage_path` be set to?
158 | 
159 | 	Storage path can be set to a NFS mount in a cluster, a local path on a single-node cluster, or a cloud storage URI supported by pyarrow. In these cases, Train will use the `pyarrow.fs.sync_files` API to copy data to the specified storage path.
160 | 
161 |     If storage path is not set, sync is disabled, and an error is raised if a remote trial tries to persist a checkpoint.
162 | 
163 | - What if I do not have NFS or cloud storage?
164 | 
165 | 	Train no longer supports persistence for distributed experiments unless you specify a `storage_path`. You can provide a custom Arrow filesystem to support custom storage solutions.
166 | 
167 | ## Implementation
168 | 
169 | You can find the prototype PR here: https://github.com/ray-project/ray/pull/36969
170 | 
171 | While this REP is not covering how exactly we implement the new persistence path, I make a few recommendations based on the prototype:
172 | 
173 | - We should logically consolidate the logic for syncing into a `StorageContext` class that is propagated everywhere. The idea should be the `StorageContext` is created on the driver, and passed to experiments, trials, and workers. We can fill out additional information as it is passed further (e.g., trial log dir, current checkpoint number). Centralizing our storage logic into one class makes it easier to unit test, and easier to make global changes without needing to update dozens of different files / various args scattered around the codebase, as we have today.
174 | 
175 | - We shouldn't try to "detect" whether a path is NFS. Rather, we can verify the sync succeeded, e.g., by performing a test write to a `_valid` file on the driver, and raising an error on workers if this file is not seen. This is more robust and also handles edge cases around single node and autoscaling use cases.
176 | 
177 | - It was easier to route around the legacy code and implement the new storage logic around `StorageContext`, rather than try to refactor the legacy code. I recommend we follow this pattern, guarding old vs new sync paths via a feature flag, trying as much as possible to not touch legacy code paths until we are ready to fully remove it.
178 | 
179 | ## Rollout Plan
180 | 
181 | ### Impact of Changes
182 | 
183 | This is a breaking change for users that do not already use NFS or cloud storage. For other users, the changes are for the most part limited to beta APIs (i.e., Checkpoint).
184 | 
185 | We are evaluating the impact of deprecating head node storage, by rolling the breaking change out early in Ray 2.6. This is to inform whether we should provide extended backwards compatibility.
186 | 
187 | ### Timeline
188 | 
189 | - Ray 2.6: Head node sync is disabled and will raise an exception. It is possible to re-enable by setting a flag.
190 | - Ray 2.7: New checkpoint path is rolled out; legacy code path removed (tentative).
191 | 
192 | ## Examples:
193 | 
194 | Please refer to the prototype PR: https://github.com/ray-project/ray/pull/36969
195 | 
196 | ## Test Plan and Acceptance Criteria
197 | 
198 | The new persistence behavior will be fully covered by integration tests. We plan to remove the old tests in Ray 2.7 that cover legacy behavior.
199 | 


--------------------------------------------------------------------------------
/reps/2022-04-21-state-observability-apis.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Summary
  3 | ### General Motivation
  4 | 
  5 | Ray does not currently expose public APIs for understanding the state of the system (what is running, resource consumption, errors, etc.). This is a frequent pain point for end-users and developers. This REP proposes to introduce structured APIs as part of the Ray 2.0 API for key states: actors, tasks, scheduling info, objects, resources, logs, etc.
  6 | 
  7 | #### Key requirements:
  8 | - Need to expose all necessary Ray states for basic observability.
  9 | - State APIs should be consistent / stable / well-documented. E.g., APIs names must be consistent with the concepts
 10 | - APIs should be available to all clients, such as Dashboard, CLI, and Python APIs with the same implementation.
 11 | - Shouldn’t introduce high overhead to the normal workloads.
 12 | - Under no circumstances should it be possible to crash a cluster (say, up to our scalability envelope) using the observability APIs.
 13 | 
 14 | 
 15 | ### Should this change be within `ray` or outside?
 16 | 
 17 | Yes, the states we want to observe belong to Ray's internal components.
 18 | 
 19 | ## Stewardship
 20 | ### Required Reviewers
 21 | 
 22 | @scv119, @edoakes, @ericl
 23 | 
 24 | ### Shepherd of the Proposal (should be a senior committer)
 25 | 
 26 | @edoakes
 27 | 
 28 | ## Design and Architecture
 29 | 
 30 | ### States
 31 | “States” must be closely related and consistent with “concepts / APIs” users learn. For example, if we expose tasks as core Ray concepts, users would like to see the state of tasks. Ideally, the name of the APIs should be consistent with the concepts too.
 32 | 
 33 | State APIs can be categorized into 3 types.
 34 | 
 35 | **Logical states**
 36 | - Aligned with logical concepts in which users need to learn to use Ray.
 37 | It- Useful for beginners to advanced users.
 38 | - E.g., actors, tasks, objects, resources, placement groups, runtime environments
 39 | 
 40 | 
 41 | **Physical states**
 42 | - States from physical components.
 43 | - When users scale up, it will become important to observe physical states. 
 44 | - E.g., logs, cluster health, worker processes, nodes
 45 | 
 46 | **Internal states**
 47 | - Internal information about Ray.
 48 | - They are mostly useful for maintainers to debug bugs, but users can also be benefited. 
 49 | - E.g., scheduling states, events, internal states of Ray components 
 50 | 
 51 | 
 52 | ### Status quo
 53 | The below table shows if the existing APIs support the states above. Each column explains if the API is available for each client (Python API, CLI, REST API). Each row indicates available states in Ray. 
 54 | 
 55 | <img width="826" alt="Screen Shot 2022-04-26 at 11 14 28 AM" src="https://user-images.githubusercontent.com/18510752/165365387-32a11ea4-bd0b-4df5-aeaa-f74e89a5332d.png">
 56 | <img width="854" alt="Screen Shot 2022-04-26 at 11 14 13 AM" src="https://user-images.githubusercontent.com/18510752/165365343-0f912117-feda-4050-838d-59bb30c0dd03.png">
 57 | 
 58 | 
 59 | ### Proposed APIs
 60 | All states should be visible to the users through properly documented / stable / consistently named APIs. This section proposes the state APIs to observe the following resources. 
 61 | 
 62 | - Green: Already available. Might need to make APIs public/stable. 
 63 | - Red, P0: States that cannot be accessed / hard to access now
 64 | - Yellow, P0: API exists, but we need refinement or renaming.
 65 | - Blue, P1: Less critical states. Could be missing or not refined.
 66 | 
 67 | <img width="806" alt="Screen Shot 2022-04-26 at 11 15 05 AM" src="https://user-images.githubusercontent.com/18510752/165365485-25727a08-2a08-4691-a554-fbe5b7544cde.png">
 68 | <img width="807" alt="Screen Shot 2022-04-26 at 11 15 17 AM" src="https://user-images.githubusercontent.com/18510752/165365506-1e2c6bc7-c365-4405-a4ca-2c7257051d69.png">
 69 | 
 70 | ### Terminologies
 71 | - Consumer: Consumers of state APIs. In Ray, (e.g., dashboard, CLI, and Python APIs).
 72 | - API server: Currently a Ray dashboard. Centralized server to aggregate and provide state information of the cluster.
 73 | - Ray agent: Currently a dashboard agent. A Python process running per node.
 74 | - Resources: Objects created by Ray. E.g., tasks, actors, placement groups, namespace. 
 75 | 
 76 | ### Observability API architecture
 77 | Currently, there are two big architectural problems.
 78 | 1. Each consumer has its implementation for each API. E.g., ray.state.actors() and the API /logical/actors have their own implementation. each consumer in Ray uses different components/code to implement each layer, which in turn leads to having duplicated code and non-unified abstraction, which causes the unmaintainability of state APIs.
 79 | 2. No architectural pattern to guarantee stability. E.g., ray memory will just crash if the output is big.
 80 | 
 81 | We propose to use the Ray API server (which is known as the dashboard server) as a stateless, read-only, centralized component to provide state information across Ray. An API server will be responsible for providing an interface (either gRPC or REST, depending on what API server is standardized), caching, and querying data from sources (raylet, GCS, agent, workers), and post-processing the aggregated state information. Having a centralized component with a standardized interface will bring several benefits such as (1) all consumers can avoid having diverged implementation, (2) APIs will become language agonistic, (3) and good maintainability and low complexity with the expenses of performance. Since state APIs don't have strict performance requirements, the tradeoff wouldn't be a big problem.
 82 | 
 83 | <img width="359" alt="Screen Shot 2022-04-26 at 11 15 44 AM" src="https://user-images.githubusercontent.com/18510752/165365580-f953181a-67b8-4797-883d-d9b726d2780e.png">
 84 | 
 85 | Alternatively, we can allow consumers to directly query the states from each component (which resembles how the private GCS-based state APIs work). Although it is a viable option and could be slightly more scalable (because there's no centralized component), it has several drawbacks. (1) some logic is hard to generalize (e.g., service discovery) from all consumers. (2) APIs will be difficult to be used by some consumers (e.g., dashboard) (3) we will need to develop APIs for each language. 
 86 | 
 87 | ### Stability and Scalability
 88 | Aggregation of large data to a centralized component with reasonable SLA is a hard problem to solve, especially for systems like Ray that have decentralized data sources (tasks, objects). Building systems that can scale well in such a scenario would be difficult and might require us to build/introduce new systems. For simplicity, we propose to **degrade the performance gracefully** when there is significantly high volume of data (e.g., shuffling millions of objects, and users want to access the list of all objects).
 89 | 
 90 | To support stability in the large-scale cluster, we will ensure to bound the output size of API. More concretely, there will be 4 rules. 
 91 | - O(1) overhead per node per call (e.g., lim 1000 records)
 92 | - O(n) overhead on API server per call (n: number of nodes).
 93 | - O(1) final result size (e.g., lim 10000 records)
 94 | - The API server limits the number of concurrent requests.
 95 | 
 96 | Note that it means users can lose information when there is a high volume of data. However, debugging issues with the extremely high volume of data will hinder user experience or sometimes not even possible. It means we should optimize for **reducing the output size** rather than supporting users to access all high volumes of data.
 97 | 
 98 | ### APIs
 99 | Accessing states could be categorized by 2 types of APIs.
100 | - Unique-id-based APIs. For example, actors, tasks, placement groups, nodes, etc. All states could be accessed by similar APIs.
101 | - Other types of APIs that don't have unique ids. For example, logs, resources, or health-checking. These states need to be exposed in their unique way.
102 | 
103 | Note that the API example below is not finalized, and detailed API specifications will go through the Ray API review.
104 | 
105 | #### Unique-id based APIs
106 | We propose to introduce 3 different types of APIs.
107 | - Summary API, which will provide a comprehensive view of states. For example, tasks can be grouped by a task name, and the API can provide the number of running or pending tasks. Summary API will help users quickly figure out anomalies and narrow down the root causes of various issues.
108 | - List API, which will list all existing states of resources in the cluster. A summary API is not always sufficient, and users should be able to access more fine-grained information. For example, users would like to access the "state of every single actor" to exactly figure out problematic actors in the cluster.
109 | - Get API, which will return a detailed view of a single resource state. Get API could return more information than list APIs.
110 | 
111 | All the APIs will be available through CLI, Python API, and REST endpoints.
112 | 
113 | ```
114 | # CLI
115 | ray summary actors              # Returns the summary information.
116 | ray list actors --state=CREATED # Returns all entries of actors.
117 | ray get actor [id]              # Return the detailed information of a single actor.
118 | 
119 | # Python
120 | ray.summary_actors()
121 | ray.list_actors(state="CREATED")
122 | ray.get_actor(id)
123 | 
124 | # REST
125 | /api/v1/actors/summary
126 | /api/v1/actors?state="CREATED"
127 | /api/v1/actors/{id}
128 | ```
129 | 
130 | #### Non unique-id based APIs
131 | We'd like to propose 3 additional APIs that cannot be categorized as unique-based APIs.
132 | - `ray logs`. This API will allow users to access and tail any log files in the cluster through handy CLI commands.
133 | - `ray resources`. This API will allow users to access resource usage of the cluster as well as tasks/actors that occupy those resources.
134 | - `ray health-check. This API will help users to know the health of the cluster. This API already exists as a private API, and we'd like to propose it to be a public API.
135 | 
136 | The example API and output will look like this. Details will be confirmed as a follow-up.
137 | 
138 | ```
139 | # ray logs
140 | ray logs actor [actor_id] # Print actor logs of the corresponding id.
141 | ray logs raylet [node_id] # Print the raylet logs of the corresponding id.
142 | ray logs raw [filename] # Print the file.
143 | ray logs GCS --follow # Tail the GCS log.
144 | 
145 | # ray resources
146 | ray resources # Returns the resource usage [used]/[total]. Will look like `ray status`.
147 | ray resources --per-node # Returns the resource usage per node.
148 | ray resources --detail # Returns the list of tasks/actors that use resources
149 | > CPU 4/16, GPU 1/4
150 | >     f, {CPU: 1} * 3
151 | >     Actor, {CPU: 1, GPU:1} * 1
152 | 
153 | # ray cluster-health # Returns exit code 0 if the cluster is alive.
154 | ```
155 | 
156 | ### Error handling
157 | All consumers should have the default timeout. It is the same for all internal aggregation RPCs.
158 | 
159 | - If the API server is down (or overloaded).
160 |     - The request will simply timeout and users will be notified.
161 |     - The availability of the API server is not in the scope of this REP.
162 | 
163 | - If the source is down (or overloaded)
164 |     - Return the RPC with data loss and notify users about it.
165 |     - The stability of data sources is not within the scope of this REP.
166 | 
167 | - The returned payload is too big
168 |     - We will bound the output size of all internal RPCs to make sure this is not happening in the first iteration. In the future, we can solve it by chunking the result or using streaming.
169 | 
170 | ## Compatibility, Deprecation, and Migration Plan
171 | ### Dashboard
172 | In the medium term, we'd like to make the existing dashboard use state APIs. The dashboard server has had various stability issues due to its implementation (e.g., never GC some expensive information, or querying data from the cluster inefficiently every 1 second). It also has several permanent states in memory (e.g., a list of actor task specs) which makes it difficult to be fault-tolerant. Once all APIs are implemented, we'd like to propose making the dashboard "use state APIs" on-demand with pagination, so that it will become stateless and efficient.
173 | 
174 | ### API server
175 | We'd like to rename the dashboard server as an API server. It will involve changing various file names and documentation changes. This has been the change that the community has been pushing for, and indeed we've added various dashboard-unrelated features to this component (e.g., job submission or usage states collection).
176 | 
177 | ## Test Plan and Acceptance Criteria
178 | All APIs will be fully unit tested. All specifications in this documentation will be thoroughly tested at the unit-test level. The end-to-end flow will be tested within CI tests. Before the beta release, we will add large-scale testing to precisely understand scalability limitations and performance degradation in large clusters. 
179 | 
180 | ## (Optional) Follow-on Work
181 | - Pagination when accessing states.
182 | - Dashboard migration.
183 | 


--------------------------------------------------------------------------------
/reps/2025-03-18-label-based-scheduling.md:
--------------------------------------------------------------------------------
  1 | ## Summary
  2 | ### General Motivation
  3 | 
  4 | This REP summarizes the current state of the node label scheduling feature enhancement and the remaining work to fully support scheduling using label selectors in Ray. This REP supersedes the previous [node affinity feature enhancement REP](https://github.com/ryanaoleary/enhancements/blob/main/reps/2023-02-03-node-affinity-feature-enhancements.md).
  5 | 
  6 | ### Should this change be within `ray` or outside?
  7 | 
  8 | The change should be within Ray since it's a direct enhancement to the Ray scheduler.
  9 | 
 10 | ## Stewardship
 11 | ### Required Reviewers
 12 | @MengjinYan
 13 | @andrewsykim
 14 | 
 15 | ### Shepherd of the Proposal (should be a senior committer)
 16 | @edoakes
 17 | 
 18 | ## Design and Architecture
 19 | 
 20 | ### Current implementation state
 21 | 
 22 | Ray currently supports passing labels to a node through `ray start` with the `--labels` flag in Python and parsing labels from a json string with `parse_node_labels_json`. Node information, including labels, are saved in the `GcsNodeInfo` data struct when a node is added. Ray also supports setting default labels on node add, but currently only sets `ray.io/node-id`.
 23 | 
 24 | To pass labels to a Ray node:
 25 | ```sh
 26 | ray start --head --labels='{"ray.io/accelerator-type": "A100", "region": "us"}'
 27 | ```
 28 | 
 29 | To access node labels:
 30 | ```python
 31 | ray.nodes()[0]["Labels"] == {"ray.io/accelerator-type": "A100", "region": "us"}
 32 | ```
 33 | 
 34 | To schedule nodes based on these labels, users specify `scheduling_strategy=NodeLabelSchedulingStrategy` as follows:
 35 | ```python
 36 |  actor = MyActor.options(
 37 |     scheduling_strategy=NodeLabelSchedulingStrategy({"ray.io/availability-zone": In("us-central2-b")})
 38 | ).remote()
 39 | ```
 40 | 
 41 | With both hard and soft constraints:
 42 | ```python
 43 | MyActor.options(
 44 |     actor = MyActor.options(
 45 |         scheduling_strategy=NodeLabelSchedulingStrategy(
 46 |             {"ray.io/accelerator-type": NotIn("A100", "T100"), "other_key": DoesNotExist()}
 47 |             hard={"ray.io/accelerator-type": DoesNotExist()},
 48 |             soft={"ray.io/accelerator-type": In("A100")},
 49 |         )
 50 |     )
 51 | ).remote()
 52 | ```
 53 | 
 54 | These API are currently [hidden](https://github.com/ray-project/ray/blob/da092abe3d4adfe2c5d94bde64c97a994a2e061b/python/ray/scripts/scripts.py#L628) and not publicly exposed.
 55 | The above API is supported through the following internal implementation:
 56 | 
 57 | NodeInfo struct:
 58 | ```python
 59 | message GcsNodeInfo {
 60 |   ...
 61 |   // The key-value labels of this node.
 62 |   map<string, string> labels = 26;
 63 |   ...
 64 | }
 65 | ```
 66 | 
 67 | Add labels from GCS when a Node is added:
 68 | ```python
 69 | void NodeManager::NodeAdded(const GcsNodeInfo &node_info) {
 70 |   ...
 71 |   // Set node labels when node added.
 72 |   absl::flat_hash_map<std::string, std::string> labels(node_info.labels().begin(),
 73 |                                                        node_info.labels().end());
 74 |   cluster_resource_scheduler_->GetClusterResourceManager().SetNodeLabels(
 75 |       scheduling::NodeID(node_id.Binary()), labels);
 76 |   ...
 77 | }
 78 | ```
 79 | 
 80 | Add default labels:
 81 | ```python
 82 | void NodeManagerConfig::AddDefaultLabels(const std::string &self_node_id) {
 83 |   # Adds the default `ray.io/node-id` label to the label mapping
 84 | }
 85 | ```
 86 | 
 87 | Get node labels from GCS:
 88 | ```python
 89 | std::unordered_map<std::string, std::string> PythonGetNodeLabels(const rpc::GcsNodeInfo &node_info) {
 90 |   # Returns the current list of labels from the GcsNodeInfo
 91 | }
 92 | ```
 93 | 
 94 | And finally a `NodeLabelSchedulingStrategy` Scheduling Policy with the following key functions. This scheduling strategy has not yet been added to the [`SchedulingStrategy` proto](https://github.com/larrylian/ray/blob/66c05338b07f1ef149928d4742b5f70c6c49b138/src/ray/protobuf/common.proto#L72), but an alpha version is public in the [Python worker](https://github.com/ray-project/ray/blob/07cdfec1fd9b63559cb1d47b5197ef5318f4d43e/python/ray/util/scheduling_strategies.py#L40).
 95 | ```python
 96 | scheduling::NodeID NodeLabelSchedulingPolicy::Schedule(...) {
 97 |     # Filters the feasible nodes - those that satisfy the provided resource request - by the
 98 |     # hard constraints of the label selectors and conditions, and then creates another list
 99 |     # of those nodes which satisfy both the hard and soft label conditions. Schedule then returns
100 |     # the best node from these two lists.
101 | }
102 | 
103 | scheduling::NodeID NodeLabelSchedulingPolicy::SelectBestNode(...) {
104 |     # If non-empty, returns a random node from the list of available nodes which satisfy both
105 |     # hard and soft constraints. Else, returns a random node from the list of available nodes which
106 |     # satify the hard conditions. If there are no available nodes, returns a random feasible node
107 |     # from the hard and soft matches, or the hard matches if the former is empty.
108 | }
109 | 
110 | NodeLabelSchedulingPolicy::FilterNodesByLabelMatchExpressions(...) {
111 |     # Iterates through candidate nodes and returns list of those which satisfy the conditions.
112 | }
113 | 
114 | NodeLabelSchedulingPolicy::IsNodeMatchLabelExpression(
115 |     const Node &node, const rpc::LabelMatchExpression &expression) const {
116 |     # Returns a bool based on whether a node's labels satisfy the given condition.
117 |     # Supports exists, not exists, in, and not in conditions. We should also extend 
118 |     # support to equal and not equal.
119 | }
120 | ```
121 | 
122 | ### Brief idea
123 | In order to implement full label based scheduling as described in the [public proposal](https://docs.google.com/document/d/1DKhPuZERlLbsU4TIQZZ6POCsm1pVMBgN_yn5r0OmaDI), there are several required changes to the existing API and internal implementation in Ray core. Since most of the core functionality for adding, storing, and retrieving node labels is already implemented, the primary changes proposed here are to update the APIs, support autoscaling, and directly schedule nodes based on label selectors passed to Ray tasks/actors, rather than requiring a separate scheduling policy.
124 | 
125 | 
126 | ### API Design
127 | 
128 | To pass labels to a Ray node, we will amend the `--labels` argument to `ray start` to accept a string list of key-value pairs. Currently the labels argument accepts a json struct.
129 | ```sh
130 | ray start --labels "key1=val1,key2=val2"
131 | ```
132 | 
133 | We will also support reading labels from a file passed to `ray start`. This command will read labels in YAML format to support passing down Pod labels into the Raylet using downward API. The labels passed in from file should be composable with those specified by `--labels`, with the value in `--labels` taking precedence if there is a conflict.
134 | ```sh
135 | ray start --labels-file /path/to/labels.yaml
136 | ```
137 | 
138 | To then schedule based on these labels, we will support passing a `label_selector` argument to the `@ray.remote` decorator. Adding this API here, rather than as a task/actor `scheduling_strategy`, will enable users to utilize label selectors in addition to other scheduling strategies.
139 | ```python
140 | @ray.remote(label_selector={"ray.io/accelerator-type": "nvidia-h100"})
141 | class Actor:
142 |     pass
143 | ...
144 | 
145 | @ray.remote(label_selector={"ray.io/market-type": "spot"})
146 | def my_task():
147 |     pass
148 | ```
149 | 
150 | Or in the Ray task/actor options:
151 | ```python
152 | actor_1 = Actor.options(
153 |     num_gpus=1,
154 |     resources={"custom_resource": 1},
155 |     label_selector={"ray.io/accelerator-type": "nvidia-h100"},
156 | ).remote()
157 | ```
158 | 
159 | The `label_selector` requirement will be ignored for scheduling when running on a local Ray cluster. A warning indicating this behavior will be logged in this case.
160 | 
161 | To schedule placement groups based on labels we will implement support for applying label selectors to placement groups on a per-bundle level. This would require adding a `bundle_label_selector` to the `ray.util.placement_group` constructor. The items in `bundle_label_selector` map 1:1 with the items in `bundles`.
162 | ```python
163 | # Same labels on all bundles
164 | ray.util.placement_group(
165 |     bundles=[{"GPU": 1}, {"GPU": 1}],
166 |     bundle_label_selector=[{"ray.io/availability-zone": "us-west4-a"}] * 2,
167 | )
168 | 
169 | # Different bundles requiring different labels
170 | ray.util.placement_group(
171 |     bundles=[{"CPU": 1}] + [{"GPU": 1}] * 2,
172 |     bundle_label_selector=[{"ray.io/market_type": "spot"}] + [{"ray.io/accelerator-type": "nvidia-h100"}] * 2
173 | )
174 | ```
175 | 
176 | Finally, we will implement a `fallback_strategy` API to support soft constraints or multiple deployment options if the initial `label_selector` cannot be satisfied.
177 | ```python
178 | @ray.remote(
179 |     label_selector={"instance_type": "m5.16xlarge"},
180 |     fallback_strategy=[
181 |         # Fall back to an empty set of labels (no constraints).
182 |         # This is equivalent to a "soft" constraint for an m5.16xlarge.
183 |         {"label_selector": {}},
184 |     ],
185 | )
186 | ```
187 | 
188 | For placement groups:
189 | ```python
190 | # Prefer 2 H100s, fall back to 4 A100s, then fall back to 8 V100s:
191 | ray.util.placement_group(
192 |     bundles=[{"GPU": 1} * 2],
193 |     bundle_label_selector=[{"ray.io/accelerator-type": "nvidia-h100"}] * 2,
194 |     fallback_strategy=[
195 |         {
196 |             "bundles": [{"GPU": 1}] * 4,
197 |             "bundle_label_selector": [{"ray.io/accelerator-type": "nvidia-h100"}] * 4
198 |         },
199 |         {
200 |             "bundles": [{"GPU": 1}] * 8,
201 |             "bundle_label_selector": [{"ray.io/accelerator-type": "nvidia_v100"}] * 8
202 |         },
203 |     ],
204 | )
205 | ```
206 | 
207 | 
208 | ### Label selector requirements
209 | This API is based on [K8s labels and selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/). Labels are key-value pairs which conform to the same format and restrictions as [Kubernetes](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set), with both the key and value required to be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and alphanumerics between.
210 | 
211 | Operators replace the label value and define the desired condition of each label. Operators are case insensitive and will support a string-based operator syntax. The initial list of supported operators is as follows:
212 | - Equal: label equals exactly one value
213 |     - `{“key”: “value”}`
214 | 
215 | - Not Equal: label equals anything but one value
216 |     - `{“key”: “!value”}`
217 | 
218 | - In: label matches one of the provided values
219 |     - `{“key”: “in(val1,val2)”}`
220 | 
221 | - Not In: label matches none of the provided values
222 |     - `{“key”: “!in(val1,val2)”}`
223 | 
224 | To be added later if needed:
225 | - Exists: label exists on the node
226 |     - `{“key”: “exists()”}`
227 | 
228 | - Does Not Exist: label does not exist on the node
229 |     - `{“key”: “!exists()”}`
230 | 
231 | 
232 | ### Default labels
233 | The initial set of supported default labels will be:
234 | - `ray.io/node-id`
235 |     - this label is already supported
236 | - `ray.io/accelerator-type`
237 |     - Set to "” on CPU-only machines.
238 |     - Supports existing accelerator type strings.
239 | - `ray.io/market-type`
240 |     - spot or on-demand
241 | - `ray.io/node-group`
242 |     - head or worker group name set by autoscaler
243 | - `ray.io/availability-zone`
244 | 
245 | These labels will be automatically populated based on the Kubernetes label or from information such as the GCE metadata when necessary.
246 | 
247 | 
248 | ### Implementation plan
249 | 
250 | A portion of the internal implementation to save node labels, match based on label conditions, and support node labels in the core Python worker already exists. The primary changes required are to update the current APIs to those described above, move the logic from the `NodeLabelSchedulingStrategy` directly to the [cluster resource scheduler](https://github.com/ray-project/ray/blob/07cdfec1fd9b63559cb1d47b5197ef5318f4d43e/src/ray/raylet/scheduling/cluster_resource_scheduler.cc#L149), and implement support for autoscaling.
251 | 
252 | Overview of Ray scheduler steps during label based scheduling:
253 | 1. Ray gets a request to schedule an actor or task based on some resources and labels.
254 | 2. Ray filters the feasible nodes by those that satisfy the resource request. A feasible node is one with sufficient total resources to satisfy the request, although those resources may not currently be available.  
255 | 3. Ray hard matches nodes that satisfy the resource request with those that satisfy the label selector and expression.
256 | 4. If no nodes match and a `fallback_strategy` is provided, filter by the provided fallback label selectors one-by-one until there is a match and return the list of candidate nodes.
257 | 5. Ray returns the best schedulable node from the list of available (or feasible if no nodes are available) that satisfy the expressions in steps 3 and/or 4.
258 | 
259 | Remaining steps to implement the label based scheduling feature: https://github.com/ray-project/ray/issues/51564
260 | 
261 | 
262 | ### Autoscaler adaptation
263 | Label based scheduling support should be added to the Ray V2 Autoscaler, only supporting the Kubernetes stack at first. Once the VM stack is also migrated to the V2 autoscaler, we can extend label based scheduling support. In order to inform scaling decisions based on user provided label selectors to Ray tasks/actors, it's necessary to propogate label information at runtime to the autoscaler and GCS. The required changes to the Ray Autoscaler V2 APIs and data model are described above in the implementation plan.
264 | 
265 | 
266 | ## Compatibility, Deprecation, and Migration Plan
267 | As the above APIs are implemented, we can deprecate redundant functionality like `accelerator-type`, but retain NodeAffinitySchedulingStrategy until soft constraints are supported through `fallback_strategy`. We will update libraries, documentation, and examples where appropriate to use the new label selector API.
268 | 
269 | 
270 | ## Test Plan and Acceptance Criteria
271 | All APIs will be rigorously unit tested, ensuring thorough validation of the documented specifications. End-to-end flows will be covered in CI tests. Prior to promoting this API to beta, we will add large-scale tests to assess scalability limits and performance impact on large clusters. End-to-end testing can be added to the KubeRay repo for the K8s stack as well as to the Ray V2 Autoscaler as part of that feature's promotion to beta.
272 | 


--------------------------------------------------------------------------------
/reps/2023-10-13-accelerator-support.md:
--------------------------------------------------------------------------------
  1 | # Ray Accelerator Support
  2 | 
  3 | ## Summary
  4 | 
  5 | Holistic design of supporting accelerators in Ray.
  6 | This also serves as the reference doc for contributors
  7 | to add new accelerator support in the future.
  8 | 
  9 | ### General Motivation
 10 | 
 11 | Nowadays, more and more specialized accelerators (e.g. TPU, HPU) are coming out to speed up AI workloads.
 12 | Supporting those accelerators natively (similar to how Ray supports Nvidia GPUs) in Ray will increase Ray's adoption as the computing framework for scaling AI applications.
 13 | 
 14 | ### Should this change be within `ray` or outside?
 15 | 
 16 | Inside `ray` project since we want to support those accelerators out of the box.
 17 | 
 18 | ## Stewardship
 19 | 
 20 | ### Required Reviewers
 21 | The proposal will be open to the public, but please suggest a few experienced Ray contributors in this technical domain whose comments will help this proposal. Ideally, the list should include Ray committers.
 22 | 
 23 | @pcmoritz, @ericl, @scv119
 24 | 
 25 | ### Shepherd of the Proposal (should be a senior committer)
 26 | To make the review process more productive, the owner of each proposal should identify a **shepherd** (should be a senior Ray committer). The shepherd is responsible for working with the owner and making sure the proposal is in good shape (with necessary information) before marking it as ready for broader review.
 27 | 
 28 | @scv119
 29 | 
 30 | ## Design
 31 | 
 32 | ### Ray Core
 33 | 
 34 | #### API
 35 | Users will use accelerators through Ray resources. For GPU accelerators (e.g. Nvidia, AMD, Intel GPUs), they will use the `GPU` resource name. For other accelerators, they will each have their own resource name (e.g. TPU accelerator has resource name `TPU`).
 36 | 
 37 | | Resource    | Accelerator               |
 38 | | ----------- | ------------------------- |
 39 | | GPU         | Nvidia, Intel, AMD, Apple |
 40 | | TPU         | TPU                       |
 41 | | HPU         | Habana Gaudi              |
 42 | | neuron_cores| AWS Trainium, Inferentia  |
 43 | | ...         | ...                       |
 44 | 
 45 | `accelerator_type` option can also be used to specify a particular type of accelerator for a task or actor.
 46 | 
 47 | ```python
 48 | @ray.remote(num_gpus=1)
 49 | def task():
 50 |     ...
 51 | 
 52 | @ray.remote(num_gpus=1, accelerator_type="A100")
 53 | def task():
 54 |     ...
 55 | 
 56 | @ray.remote(num_gpus=1, accelerator_type="Intel-Max-1550")
 57 | def task():
 58 |     ...
 59 | 
 60 | @ray.remote(resources={"TPU": 1})
 61 | def task():
 62 |     ...
 63 | 
 64 | @ray.remote(resources={"TPU": 1}, accelerator_type="TPU-V4")
 65 | def task():
 66 |     ...
 67 | ```
 68 | 
 69 | ##### Alternatives
 70 | Other alternative APIs considered:
 71 | 
 72 | Similar to `num_gpus`, having a top level parameter for each accelerator family.
 73 | One issue is that this will add many top level parameters to `.remote()` that users
 74 | need to learn even though they don't use those accelerators.
 75 | 
 76 | ```python
 77 | @ray.remote(num_tpus=1)
 78 | def task():
 79 |     ...
 80 | 
 81 | @ray.remote(num_tpus=1, accelerator_type="TPU-V4")
 82 | def task():
 83 |     ...
 84 | ```
 85 | 
 86 | A single `ACCELERATOR` resource for the ability to schedule a task on any accelerator
 87 | and an `accelerator_family` is also introduced to allow specifying the family of the accelerators to use.
 88 | People feel this is less obvious or expclit than having a resource name per accelerator family
 89 | and the ability to schedule a task on any accelerator is not that needed.
 90 | 
 91 | ```python
 92 | @ray.remote(resources={"ACCELERATOR": 1})
 93 | def task()
 94 |     ...
 95 | 
 96 | @ray.remote(resources={"ACCELERATOR": 1}, accelerator_family="TPU")
 97 | def task()
 98 |     ...
 99 | 
100 | @ray.remote(resources={"ACCELERATOR": 1}, accelerator_type="TPU-V4")
101 | def task()
102 |     ...
103 | ```
104 | 
105 | Similar to the above but having a top level `num_accelerators` parameter:
106 | 
107 | ```python
108 | @ray.remote(num_accelerators=1)
109 | def task()
110 |     ...
111 | 
112 | @ray.remote(num_accelerators=1, accelerator_family="TPU")
113 | def task()
114 |     ...
115 | 
116 | @ray.remote(num_accelerators=1, accelerator_type="TPU-V4")
117 | def task()
118 |     ...
119 | ```
120 | 
121 | ##### Runtime Context API
122 | 
123 | Currently we have `ray.get_gpu_ids()` and we need similar APIs for getting available ids of other accelerators.
124 | 
125 | The new API will be inside RuntimeContext since it returns accelerator ids that are available to a worker.
126 | 
127 | ```python
128 | class RuntimeContext:
129 |     def get_accelerator_ids() -> Dict[str, List[str]]:
130 |         """
131 |         Get the current worker's available accelerator ids.
132 | 
133 |         Returns:
134 |             A dictionary keyed by the resource name. The values are list
135 |             of ids. E.g., `{'GPU': ['0', '1']}`, `{'neuron_cores': ['0', '3']}`.
136 |         """
137 | ```
138 | 
139 | #### Implementation
140 | 
141 | Core defines an `AcceleratorManager` base class and each accelerator needs to implement a subclass (e.g. `TPUAcceleratorManager`):
142 | 
143 | ```python
144 | class AcceleratorManager(ABC):
145 |   """This class contains all the functions needed for supporting
146 |   an accelerator family in Ray."""
147 | 
148 |   @staticmethod
149 |   @abstractmethod
150 |   def get_resource_name() -> str:
151 |       """Get the name of the resource representing this accelerator family.
152 | 
153 |       Returns:
154 |           The resource name: e.g., the resource name for Nvidia GPUs is "GPU"
155 |       """
156 | 
157 |   @staticmethod
158 |   @abstractmethod
159 |   def get_visible_accelerator_ids_env_var() -> str:
160 |       """Get the env var that sets the ids of visible accelerators of this family.
161 | 
162 |       Returns:
163 |           The env var for setting visible accelerator ids: e.g.,
164 |               CUDA_VISIBLE_DEVICES for Nvidia GPUs.
165 |       """
166 | 
167 |   @staticmethod
168 |   @abstractmethod
169 |   def get_current_node_num_accelerators() -> int:
170 |       """Get the total number of accelerators of this family on the current node.
171 | 
172 |       Returns:
173 |           The detected total number of accelerators of this family.
174 |           Return 0 if the current node doesn't contain accelerators of this family.
175 |       """
176 | 
177 |   @staticmethod
178 |   @abstractmethod
179 |   def get_current_node_accelerator_type() -> Optional[str]:
180 |       """Get the type of the accelerator of this family on the current node.
181 | 
182 |       Currently Ray only supports single accelerator type of
183 |       an accelerator family on each node.
184 | 
185 |       The result should only be used when get_current_node_num_accelerators() > 0.
186 | 
187 |       Returns:
188 |           The detected accelerator type of this family: e.g., H100 for Nvidia GPU.
189 |           Return None if it's unknown or the node doesn't have
190 |           accelerators of this family.
191 |       """
192 | 
193 |   @staticmethod
194 |   @abstractmethod
195 |   def validate_resource_request_quantity(
196 |       quantity: float,
197 |   ) -> Tuple[bool, Optional[str]]:
198 |       """Validate the resource request quantity of this accelerator resource.
199 | 
200 |       Args:
201 |           quantity: The resource request quantity to be validated.
202 | 
203 |       Returns:
204 |           (valid, error_message) tuple: the first element of the tuple
205 |               indicates whether the given quantity is valid or not,
206 |               the second element is the error message
207 |               if the given quantity is invalid.
208 |       """
209 | 
210 |   @staticmethod
211 |   @abstractmethod
212 |   def get_current_process_visible_accelerator_ids() -> Optional[List[str]]:
213 |       """Get the ids of accelerators of this family that are visible to the current process.
214 | 
215 |       Returns:
216 |           The list of visiable accelerator ids.
217 |           Return None if all accelerators are visible.
218 |       """
219 | 
220 |   @staticmethod
221 |   @abstractmethod
222 |   def set_current_process_visible_accelerator_ids(ids: List[str]) -> None:
223 |       """Set the ids of accelerators of this family that are visible to the current process.
224 | 
225 |       Args:
226 |           ids: The ids of visible accelerators of this family.
227 |       """
228 | 
229 |   @staticmethod
230 |   def get_ec2_instance_num_accelerators(
231 |       instance_type: str, instances: dict
232 |   ) -> Optional[int]:
233 |       """Get the number of accelerators of this family on ec2 instance with given type.
234 | 
235 |       Args:
236 |           instance_type: The ec2 instance type.
237 |           instances: Map from ec2 instance type to instance metadata returned by
238 |               ec2 `describe-instance-types`.
239 | 
240 |       Returns:
241 |           The number of accelerators of this family on the ec2 instance
242 |           with given type.
243 |           Return None if it's unknown.
244 |       """
245 |       return None
246 | 
247 |   @staticmethod
248 |   def get_ec2_instance_accelerator_type(
249 |       instance_type: str, instances: dict
250 |   ) -> Optional[str]:
251 |       """Get the accelerator type of this family on ec2 instance with given type.
252 | 
253 |       Args:
254 |           instance_type: The ec2 instance type.
255 |           instances: Map from ec2 instance type to instance metadata returned by
256 |               ec2 `describe-instance-types`.
257 | 
258 |       Returns:
259 |           The accelerator type of this family on the ec2 instance with given type.
260 |           Return None if it's unknown.
261 |       """
262 |       return None
263 | ```
264 | 
265 | Besides defining the accelerator subclass, the accelerator resource needs to be marked as unit instance resource in config `RAY_custom_unit_instance_resources`.
266 | 
267 | ### Ray Train
268 | 
269 | Ray Train will provide 2 new APIs to support different accelerator types. 
270 | 
271 | 
272 | #### 1. Specify Resources in `ScalingConfig`
273 | 
274 | Ray Core automatically detects the accelerator type and the corresponding resources quantity. Ray Train users can directly specify the new accelerator resources in `ray.train.ScalingConfig(resources_per_worker=)`.
275 | 
276 | 
277 | ```python
278 | # CPU
279 | ScalingConfig(num_workers=4)
280 | 
281 | # GPU (Nvidia/Intel)
282 | # If use_gpu=True, set resources_per_worker={"GPU": 1} by default
283 | ScalingConfig(num_workers=4, use_gpu=True)
284 | 
285 | # AWS Trainium
286 | ScalingConfig(num_workers=4, resources_per_worker={"neuron_cores": 1})
287 | 
288 | # TPU
289 | ScalingConfig(num_workers=4, resources_per_worker={"TPU": 1})
290 | 
291 | # HPU
292 | ScalingConfig(num_workers=4, resources_per_worker={"HPU": 1})
293 | ```
294 | 
295 | In the future, Ray Train will restructure the `ScalingConfig` class, and incorporate the `accelerate_type` parameter in a more structured way.
296 | 
297 | #### 2. Specify PyTorch Backend for each Accelerator Family
298 | 
299 | Different accelerator family uses different backend to launch deep learning training. Ray Train will implement a different backend class for each accelerator family to launch distributed groups and set environment variables.
300 | 
301 | | Accelerator | Communication Backend | Developer APIs | User APIs | 
302 | | - | - | - | - |
303 | | GPU      | `nccl`, `gloo` | TorchBackend | `TorchConfig()` |
304 | | TPU      | `xla[tpu]`      | TorchTPUXLABackend | `TorchConfig("xla[tpu]")` |
305 | | Trainium | `xla[neuronx]`  | TorchAwsNeuronXLABackend | `TorchConfig("xla[neuronx]")`  |
306 | | HPU      | `hccl`         | TorchHCCLBackend | `TorchConfig("hccl")` |
307 | | XPU      | `oneCCL`       | TorchOneCCLBackend | `TorchConfig("ccl")`|
308 | 
309 | 
310 | Putting things together, users should be able to set up the resources and backend with these two APIs in `TorchTrainer`.
311 | 
312 | ```python
313 | from ray.train import ScalingConfig
314 | from ray.train.torch import TorchTrainer, TorchConfig
315 | 
316 | # TPU Training
317 | trainer = TorchTrainer(
318 |     train_func,
319 |     scaling_config=ScalingConfig(
320 |         num_workers=4, 
321 |         resources_per_worker={"TPU": 1}
322 |     ),
323 |     torch_config=TorchConfig(backend="xla[tpu]")
324 | )
325 | 
326 | # AWS Trainium
327 | trainer = TorchTrainer(
328 |     train_func,
329 |     scaling_config=ScalingConfig(
330 |         num_workers=4, 
331 |         resources_per_worker={"neuron_cores": 1}
332 |     ),
333 |     torch_config=TorchConfig(backend="xla[neuronx]")
334 | )
335 | 
336 | # HPU
337 | trainer = TorchTrainer(
338 |     train_func,
339 |     scaling_config=ScalingConfig(
340 |         num_workers=4, 
341 |         resources_per_worker={"HPU": 1}
342 |     ),
343 |     torch_config=TorchConfig(backend="hccl")
344 | )
345 | ```
346 | 
347 | 
348 | ### Ray RLLib
349 | Currently there is no plan to support other accelerators in RLLib.
350 | 
351 | ### Ray Serve
352 | Nothing needs to be changed on the serve side.
353 | 
354 | Users can define a deployment using accelerators via `ray_actor_options`:
355 | 
356 | ```python
357 | @serve.deployment(ray_actor_options={"num_gpus": 1})
358 | class Deployment:
359 |     ...
360 | 
361 | @serve.deployment(ray_actor_options={"resources": {"TPU": 1}})
362 | class Deployment:
363 |     ...
364 | ```
365 | 
366 | ### Ray Data
367 | Nothing needs to be changed on the data side.
368 | 
369 | Users can use the corresponding accelerator resources for data operations:
370 | 
371 | ```python
372 | ds.map_batches(..., resources={"TPU": 1})
373 | ```
374 | 
375 | ### Kuberay
376 | 
377 | To auto populate `rayStartParams` resources using pod resources, some code is needed.
378 | See https://github.com/ray-project/kuberay/blob/b7bc7ae8c983160bb700d6b8bb4014dda183ecea/ray-operator/controllers/ray/common/pod.go#L687 as an example of how GPU resource is populated.
379 | 
380 | ### Observability
381 | 
382 | Currently we emit metrics about the physical usages of GPUs and also show them on Ray dashboard.
383 | Similarly, for other accelerators, we need to emit those metrics as well and update the Ray dashboard
384 | and Grafana dashboard to show them.
385 | 
386 | ### Docker
387 | 
388 | Currently `rayproject/ray-ml` has `-gpu` versions that are based on `NVIDIA CUDA` image.
389 | Similarly, for other accelerators, we need to publish docker images with their software installed.
390 | 
391 | ### Documentation
392 | 
393 | All the supported accelerators need to be documented with code examples in both Ray core and Ray train.
394 | 
395 | ### Testing
396 | 
397 | For some of the CI tests, we can mock without actually running on the machines with those accelerators.
398 | But for some other CI tests and release tests, we do need machines with those accelerators to make sure real workoads can run successfully using those accelerators. Currently Ray CI only supports testing accelerators that are available on AWS.
399 | 
400 | 


--------------------------------------------------------------------------------