├── .gitignore
├── ADBicon.jpg
├── Cost Management PBI config.png
├── Cost Management PBI config2.jpg
├── Cost Management PBI report 1.png
├── Cost Management PBI report2.PNG
├── Cost Management PBI report3.PNG
├── Cost Management PBI report4.PNG
├── Cost Management Report.png
├── Cost Management Reservation.png
├── Cost Management config.png
├── Cost Management connector.png
├── Cost Management export.png
├── Cost Management filter.png
├── Cost Management.png
├── Figure1.PNG
├── Figure2.PNG
├── Figure3.PNG
├── Figure4.PNG
├── Figure5.PNG
├── Figure6.PNG
├── Figure7.PNG
├── Figure8.png
├── Grafana.PNG
├── LICENSE
├── LICENSE-CODE
├── PerfSnippet.PNG
├── Python Snippet.PNG
├── README.md
├── SECURITY.md
├── SparkSnippet.PNG
├── Table 2.PNG
├── Table1.PNG
├── Table2.PNG
├── Table3.PNG
└── toc.md
/.gitignore:
--------------------------------------------------------------------------------
1 | ## Ignore Visual Studio temporary files, build results, and
2 | ## files generated by popular Visual Studio add-ons.
3 | ##
4 | ## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore
5 |
6 | # User-specific files
7 | *.suo
8 | *.user
9 | *.userosscache
10 | *.sln.docstates
11 |
12 | # User-specific files (MonoDevelop/Xamarin Studio)
13 | *.userprefs
14 |
15 | # Build results
16 | [Dd]ebug/
17 | [Dd]ebugPublic/
18 | [Rr]elease/
19 | [Rr]eleases/
20 | x64/
21 | x86/
22 | bld/
23 | [Bb]in/
24 | [Oo]bj/
25 | [Ll]og/
26 |
27 | # Visual Studio 2015/2017 cache/options directory
28 | .vs/
29 | # Uncomment if you have tasks that create the project's static files in wwwroot
30 | #wwwroot/
31 |
32 | # Visual Studio 2017 auto generated files
33 | Generated\ Files/
34 |
35 | # MSTest test Results
36 | [Tt]est[Rr]esult*/
37 | [Bb]uild[Ll]og.*
38 |
39 | # NUNIT
40 | *.VisualState.xml
41 | TestResult.xml
42 |
43 | # Build Results of an ATL Project
44 | [Dd]ebugPS/
45 | [Rr]eleasePS/
46 | dlldata.c
47 |
48 | # Benchmark Results
49 | BenchmarkDotNet.Artifacts/
50 |
51 | # .NET Core
52 | project.lock.json
53 | project.fragment.lock.json
54 | artifacts/
55 | **/Properties/launchSettings.json
56 |
57 | # StyleCop
58 | StyleCopReport.xml
59 |
60 | # Files built by Visual Studio
61 | *_i.c
62 | *_p.c
63 | *_i.h
64 | *.ilk
65 | *.meta
66 | *.obj
67 | *.iobj
68 | *.pch
69 | *.pdb
70 | *.ipdb
71 | *.pgc
72 | *.pgd
73 | *.rsp
74 | *.sbr
75 | *.tlb
76 | *.tli
77 | *.tlh
78 | *.tmp
79 | *.tmp_proj
80 | *.log
81 | *.vspscc
82 | *.vssscc
83 | .builds
84 | *.pidb
85 | *.svclog
86 | *.scc
87 |
88 | # Chutzpah Test files
89 | _Chutzpah*
90 |
91 | # Visual C++ cache files
92 | ipch/
93 | *.aps
94 | *.ncb
95 | *.opendb
96 | *.opensdf
97 | *.sdf
98 | *.cachefile
99 | *.VC.db
100 | *.VC.VC.opendb
101 |
102 | # Visual Studio profiler
103 | *.psess
104 | *.vsp
105 | *.vspx
106 | *.sap
107 |
108 | # Visual Studio Trace Files
109 | *.e2e
110 |
111 | # TFS 2012 Local Workspace
112 | $tf/
113 |
114 | # Guidance Automation Toolkit
115 | *.gpState
116 |
117 | # ReSharper is a .NET coding add-in
118 | _ReSharper*/
119 | *.[Rr]e[Ss]harper
120 | *.DotSettings.user
121 |
122 | # JustCode is a .NET coding add-in
123 | .JustCode
124 |
125 | # TeamCity is a build add-in
126 | _TeamCity*
127 |
128 | # DotCover is a Code Coverage Tool
129 | *.dotCover
130 |
131 | # AxoCover is a Code Coverage Tool
132 | .axoCover/*
133 | !.axoCover/settings.json
134 |
135 | # Visual Studio code coverage results
136 | *.coverage
137 | *.coveragexml
138 |
139 | # NCrunch
140 | _NCrunch_*
141 | .*crunch*.local.xml
142 | nCrunchTemp_*
143 |
144 | # MightyMoose
145 | *.mm.*
146 | AutoTest.Net/
147 |
148 | # Web workbench (sass)
149 | .sass-cache/
150 |
151 | # Installshield output folder
152 | [Ee]xpress/
153 |
154 | # DocProject is a documentation generator add-in
155 | DocProject/buildhelp/
156 | DocProject/Help/*.HxT
157 | DocProject/Help/*.HxC
158 | DocProject/Help/*.hhc
159 | DocProject/Help/*.hhk
160 | DocProject/Help/*.hhp
161 | DocProject/Help/Html2
162 | DocProject/Help/html
163 |
164 | # Click-Once directory
165 | publish/
166 |
167 | # Publish Web Output
168 | *.[Pp]ublish.xml
169 | *.azurePubxml
170 | # Note: Comment the next line if you want to checkin your web deploy settings,
171 | # but database connection strings (with potential passwords) will be unencrypted
172 | *.pubxml
173 | *.publishproj
174 |
175 | # Microsoft Azure Web App publish settings. Comment the next line if you want to
176 | # checkin your Azure Web App publish settings, but sensitive information contained
177 | # in these scripts will be unencrypted
178 | PublishScripts/
179 |
180 | # NuGet Packages
181 | *.nupkg
182 | # The packages folder can be ignored because of Package Restore
183 | **/[Pp]ackages/*
184 | # except build/, which is used as an MSBuild target.
185 | !**/[Pp]ackages/build/
186 | # Uncomment if necessary however generally it will be regenerated when needed
187 | #!**/[Pp]ackages/repositories.config
188 | # NuGet v3's project.json files produces more ignorable files
189 | *.nuget.props
190 | *.nuget.targets
191 |
192 | # Microsoft Azure Build Output
193 | csx/
194 | *.build.csdef
195 |
196 | # Microsoft Azure Emulator
197 | ecf/
198 | rcf/
199 |
200 | # Windows Store app package directories and files
201 | AppPackages/
202 | BundleArtifacts/
203 | Package.StoreAssociation.xml
204 | _pkginfo.txt
205 | *.appx
206 |
207 | # Visual Studio cache files
208 | # files ending in .cache can be ignored
209 | *.[Cc]ache
210 | # but keep track of directories ending in .cache
211 | !*.[Cc]ache/
212 |
213 | # Others
214 | ClientBin/
215 | ~$*
216 | *~
217 | *.dbmdl
218 | *.dbproj.schemaview
219 | *.jfm
220 | *.pfx
221 | *.publishsettings
222 | orleans.codegen.cs
223 |
224 | # Including strong name files can present a security risk
225 | # (https://github.com/github/gitignore/pull/2483#issue-259490424)
226 | #*.snk
227 |
228 | # Since there are multiple workflows, uncomment next line to ignore bower_components
229 | # (https://github.com/github/gitignore/pull/1529#issuecomment-104372622)
230 | #bower_components/
231 |
232 | # RIA/Silverlight projects
233 | Generated_Code/
234 |
235 | # Backup & report files from converting an old project file
236 | # to a newer Visual Studio version. Backup files are not needed,
237 | # because we have git ;-)
238 | _UpgradeReport_Files/
239 | Backup*/
240 | UpgradeLog*.XML
241 | UpgradeLog*.htm
242 | ServiceFabricBackup/
243 | *.rptproj.bak
244 |
245 | # SQL Server files
246 | *.mdf
247 | *.ldf
248 | *.ndf
249 |
250 | # Business Intelligence projects
251 | *.rdl.data
252 | *.bim.layout
253 | *.bim_*.settings
254 | *.rptproj.rsuser
255 |
256 | # Microsoft Fakes
257 | FakesAssemblies/
258 |
259 | # GhostDoc plugin setting file
260 | *.GhostDoc.xml
261 |
262 | # Node.js Tools for Visual Studio
263 | .ntvs_analysis.dat
264 | node_modules/
265 |
266 | # Visual Studio 6 build log
267 | *.plg
268 |
269 | # Visual Studio 6 workspace options file
270 | *.opt
271 |
272 | # Visual Studio 6 auto-generated workspace file (contains which files were open etc.)
273 | *.vbw
274 |
275 | # Visual Studio LightSwitch build output
276 | **/*.HTMLClient/GeneratedArtifacts
277 | **/*.DesktopClient/GeneratedArtifacts
278 | **/*.DesktopClient/ModelManifest.xml
279 | **/*.Server/GeneratedArtifacts
280 | **/*.Server/ModelManifest.xml
281 | _Pvt_Extensions
282 |
283 | # Paket dependency manager
284 | .paket/paket.exe
285 | paket-files/
286 |
287 | # FAKE - F# Make
288 | .fake/
289 |
290 | # JetBrains Rider
291 | .idea/
292 | *.sln.iml
293 |
294 | # CodeRush
295 | .cr/
296 |
297 | # Python Tools for Visual Studio (PTVS)
298 | __pycache__/
299 | *.pyc
300 |
301 | # Cake - Uncomment if you are using it
302 | # tools/**
303 | # !tools/packages.config
304 |
305 | # Tabs Studio
306 | *.tss
307 |
308 | # Telerik's JustMock configuration file
309 | *.jmconfig
310 |
311 | # BizTalk build output
312 | *.btp.cs
313 | *.btm.cs
314 | *.odx.cs
315 | *.xsd.cs
316 |
317 | # OpenCover UI analysis results
318 | OpenCover/
319 |
320 | # Azure Stream Analytics local run output
321 | ASALocalRun/
322 |
323 | # MSBuild Binary and Structured Log
324 | *.binlog
325 |
326 | # NVidia Nsight GPU debugger configuration file
327 | *.nvuser
328 |
329 | # MFractors (Xamarin productivity tool) working folder
330 | .mfractor/
331 |
--------------------------------------------------------------------------------
/ADBicon.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/ADBicon.jpg
--------------------------------------------------------------------------------
/Cost Management PBI config.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI config.png
--------------------------------------------------------------------------------
/Cost Management PBI config2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI config2.jpg
--------------------------------------------------------------------------------
/Cost Management PBI report 1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report 1.png
--------------------------------------------------------------------------------
/Cost Management PBI report2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report2.PNG
--------------------------------------------------------------------------------
/Cost Management PBI report3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report3.PNG
--------------------------------------------------------------------------------
/Cost Management PBI report4.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report4.PNG
--------------------------------------------------------------------------------
/Cost Management Report.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management Report.png
--------------------------------------------------------------------------------
/Cost Management Reservation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management Reservation.png
--------------------------------------------------------------------------------
/Cost Management config.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management config.png
--------------------------------------------------------------------------------
/Cost Management connector.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management connector.png
--------------------------------------------------------------------------------
/Cost Management export.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management export.png
--------------------------------------------------------------------------------
/Cost Management filter.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management filter.png
--------------------------------------------------------------------------------
/Cost Management.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management.png
--------------------------------------------------------------------------------
/Figure1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure1.PNG
--------------------------------------------------------------------------------
/Figure2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure2.PNG
--------------------------------------------------------------------------------
/Figure3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure3.PNG
--------------------------------------------------------------------------------
/Figure4.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure4.PNG
--------------------------------------------------------------------------------
/Figure5.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure5.PNG
--------------------------------------------------------------------------------
/Figure6.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure6.PNG
--------------------------------------------------------------------------------
/Figure7.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure7.PNG
--------------------------------------------------------------------------------
/Figure8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure8.png
--------------------------------------------------------------------------------
/Grafana.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Grafana.PNG
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Attribution 4.0 International
2 |
3 | =======================================================================
4 |
5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
6 | does not provide legal services or legal advice. Distribution of
7 | Creative Commons public licenses does not create a lawyer-client or
8 | other relationship. Creative Commons makes its licenses and related
9 | information available on an "as-is" basis. Creative Commons gives no
10 | warranties regarding its licenses, any material licensed under their
11 | terms and conditions, or any related information. Creative Commons
12 | disclaims all liability for damages resulting from their use to the
13 | fullest extent possible.
14 |
15 | Using Creative Commons Public Licenses
16 |
17 | Creative Commons public licenses provide a standard set of terms and
18 | conditions that creators and other rights holders may use to share
19 | original works of authorship and other material subject to copyright
20 | and certain other rights specified in the public license below. The
21 | following considerations are for informational purposes only, are not
22 | exhaustive, and do not form part of our licenses.
23 |
24 | Considerations for licensors: Our public licenses are
25 | intended for use by those authorized to give the public
26 | permission to use material in ways otherwise restricted by
27 | copyright and certain other rights. Our licenses are
28 | irrevocable. Licensors should read and understand the terms
29 | and conditions of the license they choose before applying it.
30 | Licensors should also secure all rights necessary before
31 | applying our licenses so that the public can reuse the
32 | material as expected. Licensors should clearly mark any
33 | material not subject to the license. This includes other CC-
34 | licensed material, or material used under an exception or
35 | limitation to copyright. More considerations for licensors:
36 | wiki.creativecommons.org/Considerations_for_licensors
37 |
38 | Considerations for the public: By using one of our public
39 | licenses, a licensor grants the public permission to use the
40 | licensed material under specified terms and conditions. If
41 | the licensor's permission is not necessary for any reason--for
42 | example, because of any applicable exception or limitation to
43 | copyright--then that use is not regulated by the license. Our
44 | licenses grant only permissions under copyright and certain
45 | other rights that a licensor has authority to grant. Use of
46 | the licensed material may still be restricted for other
47 | reasons, including because others have copyright or other
48 | rights in the material. A licensor may make special requests,
49 | such as asking that all changes be marked or described.
50 | Although not required by our licenses, you are encouraged to
51 | respect those requests where reasonable. More_considerations
52 | for the public:
53 | wiki.creativecommons.org/Considerations_for_licensees
54 |
55 | =======================================================================
56 |
57 | Creative Commons Attribution 4.0 International Public License
58 |
59 | By exercising the Licensed Rights (defined below), You accept and agree
60 | to be bound by the terms and conditions of this Creative Commons
61 | Attribution 4.0 International Public License ("Public License"). To the
62 | extent this Public License may be interpreted as a contract, You are
63 | granted the Licensed Rights in consideration of Your acceptance of
64 | these terms and conditions, and the Licensor grants You such rights in
65 | consideration of benefits the Licensor receives from making the
66 | Licensed Material available under these terms and conditions.
67 |
68 |
69 | Section 1 -- Definitions.
70 |
71 | a. Adapted Material means material subject to Copyright and Similar
72 | Rights that is derived from or based upon the Licensed Material
73 | and in which the Licensed Material is translated, altered,
74 | arranged, transformed, or otherwise modified in a manner requiring
75 | permission under the Copyright and Similar Rights held by the
76 | Licensor. For purposes of this Public License, where the Licensed
77 | Material is a musical work, performance, or sound recording,
78 | Adapted Material is always produced where the Licensed Material is
79 | synched in timed relation with a moving image.
80 |
81 | b. Adapter's License means the license You apply to Your Copyright
82 | and Similar Rights in Your contributions to Adapted Material in
83 | accordance with the terms and conditions of this Public License.
84 |
85 | c. Copyright and Similar Rights means copyright and/or similar rights
86 | closely related to copyright including, without limitation,
87 | performance, broadcast, sound recording, and Sui Generis Database
88 | Rights, without regard to how the rights are labeled or
89 | categorized. For purposes of this Public License, the rights
90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar
91 | Rights.
92 |
93 | d. Effective Technological Measures means those measures that, in the
94 | absence of proper authority, may not be circumvented under laws
95 | fulfilling obligations under Article 11 of the WIPO Copyright
96 | Treaty adopted on December 20, 1996, and/or similar international
97 | agreements.
98 |
99 | e. Exceptions and Limitations means fair use, fair dealing, and/or
100 | any other exception or limitation to Copyright and Similar Rights
101 | that applies to Your use of the Licensed Material.
102 |
103 | f. Licensed Material means the artistic or literary work, database,
104 | or other material to which the Licensor applied this Public
105 | License.
106 |
107 | g. Licensed Rights means the rights granted to You subject to the
108 | terms and conditions of this Public License, which are limited to
109 | all Copyright and Similar Rights that apply to Your use of the
110 | Licensed Material and that the Licensor has authority to license.
111 |
112 | h. Licensor means the individual(s) or entity(ies) granting rights
113 | under this Public License.
114 |
115 | i. Share means to provide material to the public by any means or
116 | process that requires permission under the Licensed Rights, such
117 | as reproduction, public display, public performance, distribution,
118 | dissemination, communication, or importation, and to make material
119 | available to the public including in ways that members of the
120 | public may access the material from a place and at a time
121 | individually chosen by them.
122 |
123 | j. Sui Generis Database Rights means rights other than copyright
124 | resulting from Directive 96/9/EC of the European Parliament and of
125 | the Council of 11 March 1996 on the legal protection of databases,
126 | as amended and/or succeeded, as well as other essentially
127 | equivalent rights anywhere in the world.
128 |
129 | k. You means the individual or entity exercising the Licensed Rights
130 | under this Public License. Your has a corresponding meaning.
131 |
132 |
133 | Section 2 -- Scope.
134 |
135 | a. License grant.
136 |
137 | 1. Subject to the terms and conditions of this Public License,
138 | the Licensor hereby grants You a worldwide, royalty-free,
139 | non-sublicensable, non-exclusive, irrevocable license to
140 | exercise the Licensed Rights in the Licensed Material to:
141 |
142 | a. reproduce and Share the Licensed Material, in whole or
143 | in part; and
144 |
145 | b. produce, reproduce, and Share Adapted Material.
146 |
147 | 2. Exceptions and Limitations. For the avoidance of doubt, where
148 | Exceptions and Limitations apply to Your use, this Public
149 | License does not apply, and You do not need to comply with
150 | its terms and conditions.
151 |
152 | 3. Term. The term of this Public License is specified in Section
153 | 6(a).
154 |
155 | 4. Media and formats; technical modifications allowed. The
156 | Licensor authorizes You to exercise the Licensed Rights in
157 | all media and formats whether now known or hereafter created,
158 | and to make technical modifications necessary to do so. The
159 | Licensor waives and/or agrees not to assert any right or
160 | authority to forbid You from making technical modifications
161 | necessary to exercise the Licensed Rights, including
162 | technical modifications necessary to circumvent Effective
163 | Technological Measures. For purposes of this Public License,
164 | simply making modifications authorized by this Section 2(a)
165 | (4) never produces Adapted Material.
166 |
167 | 5. Downstream recipients.
168 |
169 | a. Offer from the Licensor -- Licensed Material. Every
170 | recipient of the Licensed Material automatically
171 | receives an offer from the Licensor to exercise the
172 | Licensed Rights under the terms and conditions of this
173 | Public License.
174 |
175 | b. No downstream restrictions. You may not offer or impose
176 | any additional or different terms or conditions on, or
177 | apply any Effective Technological Measures to, the
178 | Licensed Material if doing so restricts exercise of the
179 | Licensed Rights by any recipient of the Licensed
180 | Material.
181 |
182 | 6. No endorsement. Nothing in this Public License constitutes or
183 | may be construed as permission to assert or imply that You
184 | are, or that Your use of the Licensed Material is, connected
185 | with, or sponsored, endorsed, or granted official status by,
186 | the Licensor or others designated to receive attribution as
187 | provided in Section 3(a)(1)(A)(i).
188 |
189 | b. Other rights.
190 |
191 | 1. Moral rights, such as the right of integrity, are not
192 | licensed under this Public License, nor are publicity,
193 | privacy, and/or other similar personality rights; however, to
194 | the extent possible, the Licensor waives and/or agrees not to
195 | assert any such rights held by the Licensor to the limited
196 | extent necessary to allow You to exercise the Licensed
197 | Rights, but not otherwise.
198 |
199 | 2. Patent and trademark rights are not licensed under this
200 | Public License.
201 |
202 | 3. To the extent possible, the Licensor waives any right to
203 | collect royalties from You for the exercise of the Licensed
204 | Rights, whether directly or through a collecting society
205 | under any voluntary or waivable statutory or compulsory
206 | licensing scheme. In all other cases the Licensor expressly
207 | reserves any right to collect such royalties.
208 |
209 |
210 | Section 3 -- License Conditions.
211 |
212 | Your exercise of the Licensed Rights is expressly made subject to the
213 | following conditions.
214 |
215 | a. Attribution.
216 |
217 | 1. If You Share the Licensed Material (including in modified
218 | form), You must:
219 |
220 | a. retain the following if it is supplied by the Licensor
221 | with the Licensed Material:
222 |
223 | i. identification of the creator(s) of the Licensed
224 | Material and any others designated to receive
225 | attribution, in any reasonable manner requested by
226 | the Licensor (including by pseudonym if
227 | designated);
228 |
229 | ii. a copyright notice;
230 |
231 | iii. a notice that refers to this Public License;
232 |
233 | iv. a notice that refers to the disclaimer of
234 | warranties;
235 |
236 | v. a URI or hyperlink to the Licensed Material to the
237 | extent reasonably practicable;
238 |
239 | b. indicate if You modified the Licensed Material and
240 | retain an indication of any previous modifications; and
241 |
242 | c. indicate the Licensed Material is licensed under this
243 | Public License, and include the text of, or the URI or
244 | hyperlink to, this Public License.
245 |
246 | 2. You may satisfy the conditions in Section 3(a)(1) in any
247 | reasonable manner based on the medium, means, and context in
248 | which You Share the Licensed Material. For example, it may be
249 | reasonable to satisfy the conditions by providing a URI or
250 | hyperlink to a resource that includes the required
251 | information.
252 |
253 | 3. If requested by the Licensor, You must remove any of the
254 | information required by Section 3(a)(1)(A) to the extent
255 | reasonably practicable.
256 |
257 | 4. If You Share Adapted Material You produce, the Adapter's
258 | License You apply must not prevent recipients of the Adapted
259 | Material from complying with this Public License.
260 |
261 |
262 | Section 4 -- Sui Generis Database Rights.
263 |
264 | Where the Licensed Rights include Sui Generis Database Rights that
265 | apply to Your use of the Licensed Material:
266 |
267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right
268 | to extract, reuse, reproduce, and Share all or a substantial
269 | portion of the contents of the database;
270 |
271 | b. if You include all or a substantial portion of the database
272 | contents in a database in which You have Sui Generis Database
273 | Rights, then the database in which You have Sui Generis Database
274 | Rights (but not its individual contents) is Adapted Material; and
275 |
276 | c. You must comply with the conditions in Section 3(a) if You Share
277 | all or a substantial portion of the contents of the database.
278 |
279 | For the avoidance of doubt, this Section 4 supplements and does not
280 | replace Your obligations under this Public License where the Licensed
281 | Rights include other Copyright and Similar Rights.
282 |
283 |
284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
285 |
286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
296 |
297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
306 |
307 | c. The disclaimer of warranties and limitation of liability provided
308 | above shall be interpreted in a manner that, to the extent
309 | possible, most closely approximates an absolute disclaimer and
310 | waiver of all liability.
311 |
312 |
313 | Section 6 -- Term and Termination.
314 |
315 | a. This Public License applies for the term of the Copyright and
316 | Similar Rights licensed here. However, if You fail to comply with
317 | this Public License, then Your rights under this Public License
318 | terminate automatically.
319 |
320 | b. Where Your right to use the Licensed Material has terminated under
321 | Section 6(a), it reinstates:
322 |
323 | 1. automatically as of the date the violation is cured, provided
324 | it is cured within 30 days of Your discovery of the
325 | violation; or
326 |
327 | 2. upon express reinstatement by the Licensor.
328 |
329 | For the avoidance of doubt, this Section 6(b) does not affect any
330 | right the Licensor may have to seek remedies for Your violations
331 | of this Public License.
332 |
333 | c. For the avoidance of doubt, the Licensor may also offer the
334 | Licensed Material under separate terms or conditions or stop
335 | distributing the Licensed Material at any time; however, doing so
336 | will not terminate this Public License.
337 |
338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
339 | License.
340 |
341 |
342 | Section 7 -- Other Terms and Conditions.
343 |
344 | a. The Licensor shall not be bound by any additional or different
345 | terms or conditions communicated by You unless expressly agreed.
346 |
347 | b. Any arrangements, understandings, or agreements regarding the
348 | Licensed Material not stated herein are separate from and
349 | independent of the terms and conditions of this Public License.
350 |
351 |
352 | Section 8 -- Interpretation.
353 |
354 | a. For the avoidance of doubt, this Public License does not, and
355 | shall not be interpreted to, reduce, limit, restrict, or impose
356 | conditions on any use of the Licensed Material that could lawfully
357 | be made without permission under this Public License.
358 |
359 | b. To the extent possible, if any provision of this Public License is
360 | deemed unenforceable, it shall be automatically reformed to the
361 | minimum extent necessary to make it enforceable. If the provision
362 | cannot be reformed, it shall be severed from this Public License
363 | without affecting the enforceability of the remaining terms and
364 | conditions.
365 |
366 | c. No term or condition of this Public License will be waived and no
367 | failure to comply consented to unless expressly agreed to by the
368 | Licensor.
369 |
370 | d. Nothing in this Public License constitutes or may be interpreted
371 | as a limitation upon, or waiver of, any privileges and immunities
372 | that apply to the Licensor or You, including from the legal
373 | processes of any jurisdiction or authority.
374 |
375 |
376 | =======================================================================
377 |
378 | Creative Commons is not a party to its public
379 | licenses. Notwithstanding, Creative Commons may elect to apply one of
380 | its public licenses to material it publishes and in those instances
381 | will be considered the “Licensor.” The text of the Creative Commons
382 | public licenses is dedicated to the public domain under the CC0 Public
383 | Domain Dedication. Except for the limited purpose of indicating that
384 | material is shared under a Creative Commons public license or as
385 | otherwise permitted by the Creative Commons policies published at
386 | creativecommons.org/policies, Creative Commons does not authorize the
387 | use of the trademark "Creative Commons" or any other trademark or logo
388 | of Creative Commons without its prior written consent including,
389 | without limitation, in connection with any unauthorized modifications
390 | to any of its public licenses or any other arrangements,
391 | understandings, or agreements concerning use of licensed material. For
392 | the avoidance of doubt, this paragraph does not form part of the
393 | public licenses.
394 |
395 | Creative Commons may be contacted at creativecommons.org.
--------------------------------------------------------------------------------
/LICENSE-CODE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 | Copyright (c) Microsoft Corporation
3 |
4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
5 | associated documentation files (the "Software"), to deal in the Software without restriction,
6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense,
7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so,
8 | subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in all copies or substantial
11 | portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
--------------------------------------------------------------------------------
/PerfSnippet.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/PerfSnippet.PNG
--------------------------------------------------------------------------------
/Python Snippet.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Python Snippet.PNG
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Azure Databricks Best Practices
2 |
3 | 
4 |
5 |
6 | Authors:
7 | Dhruv Kumar, Senior Solutions Architect, Databricks
8 | Premal Shah, Azure Databricks PM, Microsoft
9 | Bhanu Prakash, Azure Databricks PM, Microsoft
10 |
11 | Written by: Priya Aswani, WW Data Engineering & AI Technical Lead
12 |
13 |
14 | Published: June 22, 2019
15 |
16 |
17 | Version: 1.0
18 |
19 |
20 | [Click here for the Best Practices](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/toc.md)
21 |
22 | Disclaimers:
23 |
24 | This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.
25 |
26 | Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred.
27 |
28 | This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes.
29 |
30 | © 2019 Microsoft. All rights reserved.
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 | # Contributing
43 |
44 | This project welcomes contributions and suggestions. Most contributions require you to agree to a
45 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
46 | the rights to use your contribution. For details, visit https://cla.microsoft.com.
47 |
48 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
49 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
50 | provided by the bot. You will only need to do this once across all repos using our CLA.
51 |
52 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
53 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
54 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
55 |
56 | # Legal Notices
57 |
58 | Microsoft and any contributors grant you a license to the Microsoft documentation and other content
59 | in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode),
60 | see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the
61 | [LICENSE-CODE](LICENSE-CODE) file.
62 |
63 | Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation
64 | may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries.
65 | The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks.
66 | Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.
67 |
68 | Privacy information can be found at https://privacy.microsoft.com/en-us/
69 |
70 | Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents,
71 | or trademarks, whether by implication, estoppel or otherwise.
72 |
--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | ## Security
4 |
5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
6 |
7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below.
8 |
9 | ## Reporting Security Issues
10 |
11 | **Please do not report security vulnerabilities through public GitHub issues.**
12 |
13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report).
14 |
15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey).
16 |
17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc).
18 |
19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
20 |
21 | * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
22 | * Full paths of source file(s) related to the manifestation of the issue
23 | * The location of the affected source code (tag/branch/commit or direct URL)
24 | * Any special configuration required to reproduce the issue
25 | * Step-by-step instructions to reproduce the issue
26 | * Proof-of-concept or exploit code (if possible)
27 | * Impact of the issue, including how an attacker might exploit the issue
28 |
29 | This information will help us triage your report more quickly.
30 |
31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs.
32 |
33 | ## Preferred Languages
34 |
35 | We prefer all communications to be in English.
36 |
37 | ## Policy
38 |
39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd).
40 |
41 |
42 |
--------------------------------------------------------------------------------
/SparkSnippet.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/SparkSnippet.PNG
--------------------------------------------------------------------------------
/Table 2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table 2.PNG
--------------------------------------------------------------------------------
/Table1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table1.PNG
--------------------------------------------------------------------------------
/Table2.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table2.PNG
--------------------------------------------------------------------------------
/Table3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table3.PNG
--------------------------------------------------------------------------------
/toc.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 | # Azure Databricks Best Practices
10 |
11 | Authors:
12 | * Dhruv Kumar, Senior Solutions Architect, Databricks
13 | * Premal Shah, Azure Databricks PM, Microsoft
14 | * Bhanu Prakash, Azure Databricks PM, Microsoft
15 |
16 | Written by: Priya Aswani, WW Data Engineering & AI Technical Lead
17 |
18 | # Table of Contents
19 |
20 | - [Introduction](#Introduction)
21 | - [Scalable ADB Deployments: Guidelines for Networking, Security, and Capacity Planning](#scalable-ADB-Deployments-Guidelines-for-Networking-Security-and-Capacity-Planning)
22 | * [Azure Databricks 101](#Azure-Databricks-101)
23 | * [Map Workspaces to Business Units](#Map-Workspaces-to-Business-Divisions)
24 | * [Deploy Workspaces in Multiple Subscriptions to Honor Azure Capacity Limits](#Deploy-Workspaces-in-Multiple-Subscriptions-to-Honor-Azure-Capacity-Limits)
25 | + [ADB Workspace Limits](#ADB-Workspace-Limits)
26 | + [Azure Subscription Limits](#Azure-Subscription-Limits)
27 | * [Consider Isolating Each Workspace in its own VNet](#Consider-Isolating-Each-Workspace-in-its-own-VNet)
28 | * [Select the Largest Vnet CIDR](#Select-the-Largest-Vnet-CIDR)
29 | * [Azure Databricks Deployment with limited private IP addresses](#Azure-Databricks-Deployment-with-limited-private-IP-addresses)
30 | * [Do not Store any Production Data in Default DBFS Folders](#Do-not-Store-any-Production-Data-in-Default-DBFS-Folders)
31 | * [Always Hide Secrets in a Key Vault](#Always-Hide-Secrets-in-a-Key-Vault)
32 | - [Deploying Applications on ADB: Guidelines for Selecting, Sizing, and Optimizing Clusters Performance](#Deploying-Applications-on-ADB-Guidelines-for-Selecting-Sizing-and-Optimizing-Clusters-Performance)
33 | * [Support Interactive analytics using Shared High Concurrency Clusters](#support-interactive-analytics-using-shared-high-concurrency-clusters)
34 | * [Support Batch ETL Workloads with Single User Ephemeral Standard Clusters](#support-batch-etl-workloads-with-single-user-ephemeral-standard-clusters)
35 | * [Favor Cluster Scoped Init scripts over Global and Named scripts](#favor-cluster-scoped-init-scripts-over-global-and-named-scripts)
36 | * [Use Cluster Log Delivery Feature to Manage Logs](#Use-Cluster-Log-Delivery-Feature-to-Manage-Logs)
37 | * [Choose VMs to Match Workload](#Choose-VMs-to-Match-Workload)
38 | * [Arrive at Correct Cluster Size by Iterative Performance Testing](#Arrive-at-correct-cluster-size-by-iterative-performance-testing)
39 | * [Tune Shuffle for Optimal Performance](#Tune-shuffle-for-optimal-performance)
40 | * [Partition Your Data](#partition-your-data)
41 | - [Running ADB Applications Smoothly: Guidelines on Observability and Monitoring](#Running-ADB-Applications-Smoothly-Guidelines-on-Observability-and-Monitoring)
42 | * [Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace](#Collect-resource-utilization-metrics-across-Azure-Databricks-cluster-in-a-Log-Analytics-workspace)
43 | + [Querying VM metrics in Log Analytics once you have started the collection using the above document](#Querying-VM-metrics-in-log-analytics-once-you-have-started-the-collection-using-the-above-document)
44 | - [Cost Management, Chargeback and Analysis](#Cost-Management-Chargeback-and-Analysis)
45 | - [Appendix A](#Appendix-A)
46 | * [Installation for being able to capture VM metrics in Log Analytics](#Installation-for-being-able-to-capture-VM-metrics-in-Log-Analytics)
47 | + [Overview](#Overview)
48 | + [Step 1 - Create a Log Analytics Workspace](#step-1---create-a-log-analytics-workspace)
49 | + [Step 2 - Get Log Analytics Workspace Credentials](#step-2--get-log-analytics-workspace-credentials)
50 | + [Step 3 - Configure Data Collection in Log Analytics Workspace](#step-3---configure-data-collection-in-log-analytics-workspace)
51 | + [Step 4 - Configure the Init Script](#Step-4---Configure-the-Init-script)
52 | + [Step 5 - View Collected Data via Azure Portal](#Step-5---View-Collected-Data-via-Azure-Portal)
53 | + [References](#References)
54 | * [Access patterns with Azure Data Lake Storage Gen2](#Access-patterns-with-Azure-Data-Lake-Storage-Gen2)
55 |
56 |
57 |
58 |
66 |
67 |
68 |
69 |
70 | > ***"A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."***
71 | Antoine de Saint-Exupéry
72 |
73 |
74 |
75 | ## Introduction
76 |
77 | Planning, deploying, and running Azure Databricks (ADB) at scale requires one to make many architectural decisions.
78 |
79 | While each ADB deployment is unique to an organization's needs we have found that some patterns are common across most successful ADB projects. Unsurprisingly, these patterns are also in-line with modern Cloud-centric development best practices.
80 |
81 | This short guide summarizes these patterns into prescriptive and actionable best practices for Azure Databricks. We follow a logical path of planning the infrastructure, provisioning the workspaces,developing Azure Databricks applications, and finally, running Azure Databricks in production.
82 |
83 | The audience of this guide are system architects, field engineers, and development teams of customers, Microsoft, and Databricks. Since the Azure Databricks product goes through fast iteration cycles, we have avoided recommendations based on roadmap or Private Preview features.
84 |
85 | Our recommendations should apply to a typical Fortune 500 enterprise with at least intermediate level of Azure and Databricks knowledge. We've also classified each recommendation according to its likely impact on solution's quality attributes. Using the **Impact** factor, you can weigh the recommendation against other competing choices. Example: if the impact is classified as “Very High”, the implications of not adopting the best practice can have a significant impact on your deployment.
86 |
87 | **Important Note**: This guide is intended to be used with the detailed [Azure Databricks Documentation](https://docs.azuredatabricks.net/index.html)
88 |
89 | ## Scalable ADB Deployments: Guidelines for Networking, Security, and Capacity Planning
90 |
91 | Azure Databricks (ADB) deployments for very small organizations, PoC applications, or for personal education hardly require any planning. You can spin up a Workspace using Azure Portal in a matter of minutes, create a Notebook, and start writing code.
92 |
93 | Enterprise-grade large scale deployments are a different story altogether. Some upfront planning is necessary to manage Azure Databricks deployments across large teams. In particular, you need to understand:
94 |
95 | * Networking requirements of Databricks
96 | * The number and the type of Azure networking resources required to launch clusters
97 | * Relationship between Azure and Databricks jargon: Subscription, VNet., Workspaces, Clusters, Subnets, etc.
98 | * Overall Capacity Planning process: where to begin, what to consider?
99 |
100 |
101 | Let’s start with a short Azure Databricks 101 and then discuss some best practices for scalable and secure deployments.
102 |
103 | ## Azure Databricks 101
104 |
105 | ADB is a Big Data analytics service. Being a Cloud Optimized managed [PaaS](https://azure.microsoft.com/en-us/overview/what-is-paas/) offering, it is designed to hide the underlying distributed systems and networking complexity as much as possible from the end user. It is backed by a team of support staff who monitor its health, debug tickets filed via Azure, etc. This allows ADB users to focus on developing value generating apps rather than stressing over infrastructure management.
106 |
107 | You can deploy ADB using Azure Portal or using [ARM templates](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-overview#template-deployment). One successful ADB deployment produces exactly one Workspace, a space where users can log in and author analytics apps. It comprises the file browser, notebooks, tables, clusters, [DBFS](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html#dbfs) storage, etc. More importantly, Workspace is a fundamental isolation unit in Databricks. All workspaces are completely isolated from each other.
108 |
109 | Each workspace is identified by a globally unique 53-bit number, called ***Workspace ID or Organization ID***. The URL that a customer sees after logging in always uniquely identifies the workspace they are using.
110 |
111 | *https://adb-workspaceId.azuredatabricks.net/?o=workspaceId*
112 |
113 | Example: *https://adb-12345.eastus2.azuredatabricks.net/?o=12345*
114 |
115 | Azure Databricks uses [Azure Active Directory (AAD)](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis) as the exclusive Identity Provider and there’s a seamless out of the box integration between them. This makes ADB tightly integrated with Azure just like its other core services. Any AAD member assigned to the Owner or Contributor role can deploy Databricks and is automatically added to the ADB members list upon first login. If a user is not a member or guest of the Active Directory tenant, they can’t login to the workspace.
116 | Granting access to a user in another tenant (for example, if contoso.com wants to collaborate with adventure-works.com users) does work because those external users are added as guests to the tenant hosting Azure Databricks.
117 |
118 | Azure Databricks comes with its own user management interface. You can create users and groups in a workspace, assign them certain privileges, etc. While users in AAD are equivalent to Databricks users, by default AAD roles have no relationship with groups created inside ADB, unless you use [SCIM](https://docs.azuredatabricks.net/administration-guide/admin-settings/scim/aad.html) for provisioning users and groups. With SCIM, you can import both groups and users from AAD into Azure Databricks, and the synchronization is automatic after the initial import. ADB also has a special group called ***Admins***, not to be confused with AAD’s role Admin.
119 |
120 | The first user to login and initialize the workspace is the workspace ***owner***, and they are automatically assigned to the Databricks admin group. This person can invite other users to the workspace, add them as admins, create groups, etc. The ADB logged in user’s identity is provided by AAD, and shows up under the user menu in Workspace:
121 |
122 |
123 |
124 |
125 | Figure 1: Databricks user menu
126 |
127 |
128 | Multiple clusters can exist within a workspace, and there’s a one-to-many mapping between a Subscription to Workspaces, and further, from one Workspace to multiple Clusters.
129 |
130 | 
131 |
132 | *Figure 2: Relationship Between AAD, Workspace, Resource Groups, and Clusters
133 |
134 | With this basic understanding let’s discuss how to plan a typical ADB deployment. We first grapple with the issue of how to divide workspaces and assign them to users and teams.
135 |
136 |
137 | ## Map Workspaces to Business Divisions
138 | *Impact: Very High*
139 |
140 | How many workspaces do you need to deploy? The answer to this question depends a lot on your organization’s structure. We recommend that you assign workspaces based on a related group of people working together collaboratively. This also helps in streamlining your access control matrix within your workspace (folders, notebooks etc.) and also across all your resources that the workspace interacts with (storage, related data stores like Azure SQL DB, Azure SQL DW etc.). This type of division scheme is also known as the [Business Unit Subscription](https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/decision-guides/subscriptions/) design pattern and it aligns well with the Databricks chargeback model.
141 |
142 |
143 |
144 |
145 |
146 |
147 | *Figure 3: Business Unit Subscription Design Pattern*
148 |
149 | ## Deploy Workspaces in Multiple Subscriptions to Honor Azure Capacity Limits
150 | *Impact: Very High*
151 |
152 | Customers commonly partition workspaces based on teams or departments and arrive at that division naturally. But it is also important to partition keeping Azure Subscription and ADB Workspace limits in mind.
153 |
154 | ### Databricks Workspace Limits
155 | Azure Databricks is a multitenant service and to provide fair resource sharing to all regional customers, it imposes limits on API calls. These limits are expressed at the Workspace level and are due to internal ADB components. For instance, you can only run up to 1000 concurrent jobs in a workspace. Beyond that, ADB will deny your job submissions. There are also other limits such as max hourly job submissions, max notebooks, etc.
156 |
157 | Key workspace limits are:
158 |
159 | * The maximum number of jobs that a workspace can create in an hour is **5000**
160 | * At any time, you cannot have more than **1000 jobs** simultaneously running in a workspace
161 | * There can be a maximum of **145 notebooks** attached to a cluster
162 |
163 | ### Azure Subscription Limits
164 | Next, there are [Azure limits](https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits) to consider since ADB deployments are built on top of the Azure infrastructure.
165 |
166 | For more help in understanding the impact of these limits or options of increasing them, please contact Microsoft or Databricks technical architects.
167 |
168 | > ***We highly recommend separating workspaces into production and dev, and deploying them into separate subscriptions.***
169 |
170 | ## Consider Isolating Each Workspace in its own VNet
171 | *Impact: Low*
172 |
173 | While you can deploy more than one Workspace in a VNet by keeping the associated subnet pairs separate from other workspaces, we recommend that you should only deploy one workspace in any Vnet. Doing this perfectly aligns with the ADB's Workspace level isolation model. Most often organizations consider putting multiple workspaces in the same Vnet so that they all can share some common networking resource, like DNS, also placed in the same Vnet because the private address space in a vnet is shared by all resources. You can easily achieve the same while keeping the Workspaces separate by following the [hub and spoke model](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/hybrid-networking/hub-spoke) and using Vnet Peering to extend the private IP space of the workspace Vnet. Here are the steps:
174 | 1. Deploy each Workspace in its own spoke VNet.
175 | 2. Put all the common networking resources in a central hub Vnet, such as your custom DNS server.
176 | 3. Join the Workspace spokes with the central networking hub using [Vnet Peering](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-peering.html)
177 |
178 | More information: [Azure Virtual Datacenter: a network perspective](https://docs.microsoft.com/en-us/azure/architecture/vdc/networking-virtual-datacenter#topology)
179 |
180 |
181 |
182 |
183 |
184 | *Figure 4: Hub and Spoke Model*
185 |
186 | ## Select the Largest Vnet CIDR
187 | *Impact: Very High*
188 |
189 | > ***This recommendation only applies if you're using the Bring Your Own Vnet feature.***
190 |
191 | Recall the each Workspace can have multiple clusters. The total capacity of clusters in each workspace is a function of the masks used for the workspace's enclosing Vnet and the pair of subnets associated with each cluster in the workspace. The masks can be changed if you use the [Bring Your Own Vnet](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.html#vnet-inject) feature as it gives you more control over the networking layout. It is important to understand this relationship for accurate capacity planning.
192 |
193 | * Each cluster node requires 1 Public IP and 2 Private IPs
194 | * These IPs are logically grouped into 2 subnets named “public” and “private”
195 | * For a desired cluster size of X: number of Public IPs = X, number of Private IPs = 2X
196 | * The size of private and public subnets thus determines total number of VMs available for clusters
197 | + /22 mask is larger than /23, so setting private and public to /22 will have more VMs available for creating clusters, than say /23 or below
198 | * But, because of the address space allocation scheme, the size of private and public subnets is constrained by the VNet’s CIDR
199 | * The allowed values for the enclosing VNet CIDR are from /16 through /24
200 | * The private and public subnet masks must be:
201 | + Equal
202 | + Must be greater than /26
203 |
204 | With this info, we can quickly arrive at the table below, showing how many nodes one can use across all clusters for a given VNet CIDR. It is clear that selection of VNet CIDR has far reaching implications in terms of maximum cluster size.
205 |
206 | | Enclosing Vnet CIDR’s Mask where ADB Workspace is deployed | Allowed Masks on Private and Public Subnets (should be equal) | Max number of nodes across all clusters in the Workspace, assuming higher subnet mask is chosen |
207 | |:------------------------------------------------------------:|:---------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------:|
208 | | /16 | /17 through /26 | 32766 |
209 | | /17 | /18 through /26 | 16382 |
210 | | /18 | /19 through /26 | 8190 |
211 | | /19 | /20 through /26 | 4094 |
212 | | /20 | /21 through /26 | 2046 |
213 | | /21 | /22 through /26 | 1022 |
214 | | /22 | /23 through /26 | 510 |
215 | | /23 | /24 through /26 | 254 |
216 | | /24 | /25 or /26 | 126 |
217 |
218 |
219 |
220 | # Azure Databricks Deployment with limited private IP addresses
221 | *Impact: High*
222 |
223 | Depending where data sources are located, Azure Databricks can be deployed in a connected or disconnected scenario. In a connected scenario, Azure Databricks must be able to reach directly data sources located in Azure VNets or on-premises locations. In a disconnected scenario, data can be copied to a storage platform (such as an Azure Data Lake Storage account), to which Azure Databricks can be connected to using mount points.
224 | ***This section will cover a scenario to deploy Azure Databricks when there are limited private IP addresses and Azure Databricks can be configured to access data using mount points (disconnected scenario).***
225 |
226 | Many multi-national enterprise organizations are building platforms in Azure, based on the hub and spoke network architecture, which is a model that maps to the recommended Azure Databricks deployments, which is to deploy only one workspace in any VNet by implementing the hub and spoke network architecture. Workspaces are deployed on the spokes, while shared networking and security resources such as ExpressRoute connectivity or DNS infrastructure is deployed in the hub.
227 | Customer who have exhausted (or are near to exhaust) RFC1918 IP address ranges, have to optimize address space for spoke VNets, and may only be able to provide small VNets for most cases (/25 or smaller), and only in exceptional cases they may provide a larger VNet (such as a /24).
228 |
229 | As the smallest Azure Databricks deployment requires a /24 VNet, such customers require an alternative solution, so that the business can deploy one or multiple Azure Databricks clusters across multiple VNets (as required by the business), but also, they should be able to create larger clusters, which would require larger VNet address space.
230 |
231 | A recommended Azure Databricks implementation, which would ensure minimal RFC1918 addresses are used, while at the same time, would allow the business users to deploy as many Azure Databricks clusters as they want and as small or large as they need them, consist on the following environments within the same Azure subscription as depicted in the picture below:
232 |
233 |
234 |
235 |
236 |
237 |
238 |
239 | *Figure 8: Network Topology*
240 |
241 |
242 | As the diagram depicts, the business application subscription where Azure Databricks will be deployed, has two VNets, one that is routable to on-premises and the rest of the Azure environment (this can be a small VNet such as /26), and includes the following Azure data resources: Azure Data Factory and ADLS Gen2 (via Private Endpoint).
243 | > ***Note: While we use Azure Data Factory on this implementation, any other service that can perform similar functionality could be used.***
244 |
245 |
246 |
247 | The other VNet is fully disconnected and is not routable to the rest of the environment, and on this VNet Databricks and optionally Azure Bastion (to be able to perform management via jumpboxes) is deployed, as well as a Private Endpoint to the ADLS Gen2 storage, so that Databricks can retrieve data for ingestion. This setup is described in further details below:
248 |
249 |
250 |
251 | **Connected (routable environment)**
252 | * In a business application subscription, deploy a VNet with RFC1918 addresses which is fully routable in Azure and cross-premises via ExpressRoute. This VNet can be a small VNet, such as /26 or /27.
253 | * This VNet, is connected to a central hub VNet via VNet peering to have connectivity across Azure and on-premises via ExpressRoute or VPN.
254 | * UDR with default route (0.0.0.0/0) points to a central NVA (for example, Azure Firewall) for internet outbound traffic.
255 | * NSGs are configured to block inbound traffic from the internet.
256 | * Azure Data Lake (ADLS) Gen2 is deployed in the business application subscription.
257 | * A Private Endpoint is created on the VNet to make ADLS Gen 2 storage accessible from on-premises and from Azure VNets via a private IP address.
258 | * Azure Data Factory will be responsible for the process of moving data from the source locations (other spoke VNets or on-premises) into the ADLS Gen2 store (accessible via Private Endpoint).
259 | * Azure Data Factory (ADF) is deployed on this routable VNet
260 | * Azure Data Factory components require a compute infrastructure to run on and this is referred to as Integration Runtime. In the mentioned scenario, moving data from on-premises data sources to Azure Data Services (accessible via Private Endpoint), it is required a Self-Hosted Integration Runtime.
261 | * The Self-Hosted Integration Runtime needs to be installed on an Azure Virtual Machine inside the routable VNET in order to allow Azure Data Factory to communicate with the source data and destination data.
262 | * Considering this, Azure Data Factory only requires 1 IP address (and maximum up to 4 IP addresses) in the VNet (via the integration runtime).
263 |
264 |
265 |
266 | **Disconnected (non-routable environment)**
267 | * In the same business application subscription, deploy a VNet with any RFC1918 address space that is desired by the application team (for example, 10.0.0.0/16)
268 | * This VNet is not going to be connected to the rest of the environment. In other words, this will be a disconnected and fully isolated VNet.
269 | * This VNet includes 3 required and 3 optional subnets:
270 | * 2x of them dedicated exclusively to the Azure Databricks Workspace (private-subnet and public-subnet)
271 | * 1x which will be used for the private link to the ADLS Gen2
272 | * (Optional) 1x for Azure Bastion
273 | * (Optional) 1x for jumpboxes
274 | * (Optional but recommended) 1x for Azure Firewall (or other network security NVA).
275 | * Azure Databricks is deployed on this disconnected VNet.
276 | * Azure Bastion is deployed on this disconnected VNet, to allow Azure Databricks administration via jumpboxes.
277 | * Azure Firewall (or another network security NVA) is deployed on this disconnected VNet to secure internet outbound traffic.
278 | * NSGs are used to lockdown traffic across subnets.
279 | * 2x Private Endpoints are created on this disconnected VNet to make the ADLS Gen2 storage accessible for the Databricks cluster:
280 | * 1x private endpoint having the target sub-resource *blob*
281 | * 1x private endpoint having the target sub-resource *dfs*
282 | * Databricks integrates with ADLS Gen2 storage for data ingestion
283 |
284 |
285 |
286 | ## Do not Store any Production Data in Default DBFS Folders
287 | *Impact: High*
288 |
289 | This recommendation is driven by security and data availability concerns. Every Workspace comes with a default DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You should not store any production data in it, because:
290 | 1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contents.
291 | 2. One can't restrict access to this default folder and its contents.
292 |
293 | > ***This recommendation doesn't apply to Blob or ADLS folders explicitly mounted as DBFS by the end user***
294 |
295 | **More Information:**
296 | [Databricks File System](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html)
297 |
298 |
299 | ## Always Hide Secrets in a Key Vault
300 | *Impact: High*
301 |
302 | It is a significant security risk to expose sensitive data such as access credentials openly in Notebooks or other places such as job configs, init scripts, etc. You should always use a vault to securely store and access them.
303 | You can either use ADB’s internal Key Vault for this purpose or use Azure’s Key Vault (AKV) service.
304 |
305 | If using Azure Key Vault, create separate AKV-backed secret scopes and corresponding AKVs to store credentials pertaining to different data stores. This will help prevent users from accessing credentials that they might not have access to. Since access controls are applicable to the entire secret scope, users with access to the scope will see all secrets for the AKV associated with that scope.
306 |
307 | **More Information:**
308 |
309 | [Create an Azure Key Vault-backed secret scope](https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html)
310 |
311 | [Example of using secret in a notebook](https://docs.azuredatabricks.net/user-guide/secrets/example-secret-workflow.html)
312 |
313 | [Best practices for creating secret scopes](https://docs.azuredatabricks.net/user-guide/secrets/secret-acl.html)
314 |
315 | # Deploying Applications on ADB: Guidelines for Selecting, Sizing, and Optimizing Clusters Performance
316 |
317 | > ***"Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization's communication structure."***
318 | Mead Conway
319 |
320 |
321 | After understanding how to provision the workspaces, best practices in networking, etc., let’s put on the developer’s hat and see the design choices typically faced by them:
322 |
323 | * What type of clusters should I use?
324 | * How many drivers and how many workers?
325 | * Which Azure VMs should I select?
326 |
327 | In this chapter we will address such concerns and provide our recommendations, while also explaining the internals of Databricks clusters and associated topics. Some of these ideas seem counterintuitive but they will all make sense if you keep these important design attributes of the ADB service in mind:
328 |
329 | 1. **Cloud Optimized:** Azure Databricks is a product built exclusively for cloud environments, like Azure. No on-prem deployments currently exist. It assumes certain features are provided by the Cloud, is designed keeping Cloud best practices, and conversely, provides Cloud-friendly features.
330 | 2. **Platform/Software as a Service Abstraction:** ADB sits somewhere between the PaaS and SaaS ends of the spectrum, depending on how you use it. In either case ADB is designed to hide infrastructure details as much as possible so the user can focus on application development. It is
331 | not, for example, an IaaS offering exposing the guts of the OS Kernel to you.
332 | 3. **Managed Service:** ADB guarantees a 99.95% uptime SLA. There’s a large team of dedicated staff members who monitor various aspects of its health and get alerted when something goes wrong. It is run like an always-on website and Microsoft and Databricks system operations team strives to minimize any downtime.
333 |
334 | These three attributes make ADB very different than other Spark platforms such as HDP, CDH, Mesos, etc. which are designed for on-prem datacenters and allow the user complete control over the hardware. The concept of a cluster is therefore pretty unique in Azure Databricks. Unlike YARN or Mesos clusters which are just a collection of worker machines waiting for an application to be scheduled on them, clusters in ADB come with a pre-configured Spark application. ADB submits all subsequent user requests
335 | like notebook commands, SQL queries, Java jar jobs, etc. to this primordial app for execution.
336 |
337 | Under the covers Databricks clusters use the lightweight Spark Standalone resource allocator.
338 |
339 |
340 | When it comes to taxonomy, ADB clusters are divided along the notions of “type”, and “mode.” There are two ***types*** of ADB clusters, according to how they are created. Clusters created using UI and [Clusters API](https://docs.azuredatabricks.net/api/latest/clusters.html) are called Interactive Clusters, whereas those created using [Jobs API](https://docs.azuredatabricks.net/api/latest/jobs.html) are called Jobs Clusters. Further, each cluster can be of two ***modes***: Standard and High Concurrency. Regardless of types or mode, all clusters in Azure Databricks can automatically scale to match the workload, using a feature known as [Autoscaling](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling).
341 |
342 |
343 | *Table 2: Cluster modes and their characteristics*
344 |
345 | 
346 |
347 | ## Support Interactive Analytics Using Shared High Concurrency Clusters
348 | *Impact: Medium*
349 |
350 | There are three steps for supporting Interactive workloads on ADB:
351 | 1. Deploy a shared cluster instead of letting each user create their own cluster.
352 | 2. Create the shared cluster in High Concurrency mode instead of Standard mode.
353 | 3. Configure security on the shared High Concurrency cluster, using **one** of the following options:
354 | * Turn on [AAD Credential Passthrough](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/credential-passthrough.html#enabling-azure-ad-credential-passthrough-to-adls) if you’re using ADLS
355 | * Turn on Table Access Control for all other stores
356 |
357 | To understand why, let’s quickly see how interactive workloads are different from batch workloads:
358 |
359 | *Table 3: Batch vs. Interactive workloads*
360 | 
361 |
362 | Because of these differences, supporting Interactive workloads entails minimizing cost variability and optimizing for latency over throughput, while providing a secure environment. These goals are satisfied by shared High Concurrency clusters with Table access controls or AAD Passthrough turned on (in case of ADLS):
363 |
364 | 1. **Minimizing Cost:** By forcing users to share an autoscaling cluster you have configured with maximum node count, rather than say, asking them to create a new one for their use each time they log in, you can control the total cost easily. The max cost of shared cluster can be calculated by assuming it is running X hours at maximum size with the particular VMs. It is difficult to achieve this if each user is given free reign over creating clusters of arbitrary size and VMs.
365 |
366 | 2. **Optimizing for Latency:** Only High Concurrency clusters have features which allow queries from different users share cluster resources in a fair, secure manner. HC clusters come with Query Watchdog, a process which keeps disruptive queries in check by automatically pre-empting rogue queries, limiting the maximum size of output rows returned, etc.
367 |
368 | 3. **Security:** Table Access control feature is only available in High Concurrency mode and needs to be turned on so that users can limit access to their database objects (tables, views, functions, etc.) created on the shared cluster. In case of ADLS, we recommend restricting access using the AAD Credential Passthrough feature instead of Table Access Controls.
369 |
370 | > ***If you’re using ADLS, we recommend AAD Credential Passthrough instead of Table Access Control for easy manageability.***
371 |
372 | 
373 |
374 | *Figure 5: Interactive clusters*
375 |
376 | ## Support Batch ETL Workloads With Single User Ephemeral Standard Clusters
377 | *Impact: Medium*
378 |
379 | Unlike Interactive workloads, logic in batch Jobs is well defined and their cluster resource requirements are known *a priori*. Hence to minimize cost, there’s no reason to follow the shared cluster model and we
380 | recommend letting each job create a separate cluster for its execution. Thus, instead of submitting batch ETL jobs to a cluster already created from ADB’s UI, submit them using the Jobs APIs. These APIs automatically create new clusters to run Jobs and also terminate them after running it. We call this the **Ephemeral Job Cluster** pattern for running jobs because the clusters short life is tied to the job lifecycle.
381 |
382 | Azure Data Factory uses this pattern as well - each job ends up creating a separate cluster since the underlying call is made using the [Runs-Submit Jobs API](https://docs.azuredatabricks.net/api/latest/jobs.html#runs-submit).
383 |
384 |
385 | 
386 |
387 | *Figure 6: Ephemeral Job cluster*
388 |
389 | Just like the previous recommendation, this pattern will achieve general goals of minimizing cost, improving the target metric (throughput), and enhancing security by:
390 |
391 | 1. **Enhanced Security:** ephemeral clusters run only one job at a time, so each executor’s JVM runs code from only one user. This makes ephemeral clusters more secure than shared clusters for Java and Scala code.
392 | 2. **Lower Cost:** if you run jobs on a cluster created from ADB’s UI, you will be charged at the higher Interactive DBU rate. The lower Data Engineering DBUs are only available when the lifecycle of job and cluster are same. This is only achievable using the Jobs APIs to launch jobs on ephemeral clusters.
393 | 3. **Better Throughput:** cluster’s resources are dedicated to one job only, making the job finish faster than while running in a shared environment.
394 |
395 | For very short duration jobs (< 10 min) the cluster launch time (~ 7 min) adds a significant overhead to total execution time. Historically this forced users to run short jobs on existing clusters created by UI -- a
396 | costlier and less secure alternative. To fix this, ADB is coming out with a new feature called Instance Pools in Q3 2019 bringing down cluster launch time to 30 seconds or less.
397 |
398 | ## Favor Cluster Scoped Init Scripts over Global and Named scripts
399 | *Impact: High*
400 |
401 | [Init Scripts](https://docs.azuredatabricks.net/user-guide/clusters/init-scripts.html) provide a way to configure cluster’s nodes and can be used in the following modes:
402 |
403 | 1. **Global:** by placing the init script in `/databricks/init` folder, you force the script’s execution every time any cluster is created or restarted by users of the workspace.
404 | 2. **Cluster Named (deprecated):** you can limit the init script to run only on for a specific cluster’s creation and restarts by placing it in `/databricks/init/` folder.
405 | 3. **Cluster Scoped:** in this mode the init script is not tied to any cluster by its name and its automatic execution is not a virtue of its dbfs location. Rather, you specify the script in cluster’s configuration by either writing it directly in the cluster configuration UI or storing it on DBFS and specifying the path in [Cluster Create API](https://docs.azuredatabricks.net/user-guide/clusters/init-scripts.html#cluster-scoped-init-script). Any location under DBFS `/databricks` folder except `/databricks/init` can be used for this purpose, such as: `/databricks//set-env-var.sh`
406 |
407 | You should treat Init scripts with *extreme* caution because they can easily lead to intractable cluster launch failures. If you really need them, please use the Cluster Scoped execution mode as much as possible because:
408 |
409 | 1. ADB executes the script’s body in each cluster node. Thus, a successful cluster launch and subsequent operation is predicated on all nodal init scripts executing in a timely manner without any errors and reporting a zero exit code. This process is highly error prone, especially for scripts downloading artifacts from an external service over unreliable and/or misconfigured networks.
410 | 2. Because Global and Cluster Named init scripts execute automatically due to their placement in a special DBFS location, it is easy to overlook that they could be causing a cluster to not launch. By specifying the Init script in the Configuration, there’s a higher chance that you’ll consider them while debugging launch failures.
411 |
412 | ## Use Cluster Log Delivery Feature to Manage Logs
413 | *Impact: Medium*
414 |
415 | By default, Cluster logs are sent to default DBFS but you should consider sending the logs to a blob store location under your control using the [Cluster Log Delivery](https://docs.azuredatabricks.net/user-guide/clusters/log-delivery.html#cluster-log-delivery) feature. The Cluster Logs contain logs emitted by user code, as well as Spark framework’s Driver and Executor logs. Sending them to a blob store controlled by yourself is recommended over default DBFS location because:
416 | 1. ADB’s automatic 30-day default DBFS log purging policy might be too short for certain compliance scenarios. A blob store loction in your subscription will be free from such policies.
417 | 2. You can ship logs to other tools only if they are present in your storage account and a resource group governed by you. The root DBFS, although present in your subscription, is launched inside a Microsoft Azure managed resource group and is protected by a read lock. Because of this lock the logs are only accessible by privileged Azure Databricks framework code. However, constructing a pipeline to ship the logs to downstream log analytics tools requires logs to be in a lock-free location first.
418 |
419 | ## Choose Cluster VMs to Match Workload Class
420 | *Impact: High*
421 |
422 | To allocate the right amount and type of cluster resource for a job, we need to understand how different types of jobs demand different types of cluster resources.
423 |
424 | * **Machine Learning** - To train machine learning models it’s usually required cache all of the data in memory. Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. You can also use storage optimized instances for very large datasets. To size the cluster, take a % of the data set → cache it → see how much memory it
425 | used → extrapolate that to the rest of the data.
426 |
427 | * **Streaming** - You need to make sure that the processing rate is just above the input rate at peak times of the day. Depending peak input rate times, consider compute optimized VMs for the cluster to make sure processing rate is higher than your input rate.
428 |
429 | * **ETL** - In this case, data size and deciding how fast a job needs to be will be a leading indicator. Spark doesn’t always require data to be loaded into memory in order to execute transformations, but you’ll at the very least need to see how large the task sizes are on shuffles and compare that to the task throughput you’d like. To analyze the performance of these jobs start with basics and check if the job is by CPU, network, or local I/O, and go from there. Consider using a general purpose VM for these jobs.
430 | * **Interactive / Development Workloads** - The ability for a cluster to auto scale is most important for these types of jobs. In this case taking advantage of the [Autoscaling feature](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling) will be your best friend in managing the cost of the infrastructure.
431 |
432 | ## Arrive at Correct Cluster Size by Iterative Performance Testing
433 | *Impact: High*
434 |
435 | It is impossible to predict the correct cluster size without developing the application because Spark and Azure Databricks use numerous techniques to improve cluster utilization. The broad approach you should follow for sizing is:
436 |
437 | 1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class as explained earlier.
438 | 2. After meeting functional requirements, run end to end test on larger representative data while measuring CPU, memory and I/O used by the cluster at an aggregate level.
439 | 3. Optimize cluster to remove bottlenecks found in step 2
440 | - **CPU bound**: add more cores by adding more nodes
441 | - **Network bound**: use fewer, bigger SSD backed machines to reduce network size and improve remote read performance
442 | - **Disk I/O bound**: if jobs are spilling to disk, use VMs with more memory.
443 |
444 | Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious bottlenecks have been addressed.
445 |
446 | Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on a subset of data. In theory, Spark jobs, like jobs on other Data Intensive frameworks (Hadoop) exhibit linear scaling. For example, if it takes 5 nodes to meet SLA on a 100TB dataset, and the production data is around 1PB, then prod cluster is likely going to be around 50 nodes in size. You can use this back of the envelope calculation as a first guess to do capacity planning. However, there are scenarios where Spark jobs don’t scale linearly. In some cases this is due to large amounts of shuffle adding an exponential synchronization cost (explained next), but there could be other reasons as well. Hence, to refine the first estimate and arrive at a more accurate node count we recommend repeating this process 3-4 times on increasingly larger data set sizes, say 5%, 10%, 15%, 30%, etc. The overall accuracy of the process depends on how closely the test data matches the live workload both in type and size.
447 |
448 | ## Tune Shuffle for Optimal Performance
449 | *Impact: High*
450 |
451 | A shuffle occurs when we need to move data from one node to another in order to complete a stage. Depending on the type of transformation you are doing you may cause a shuffle to occur. This happens when all the executors require seeing all of the data in order to accurately perform the action. If the Job requires a wide transformation, you can expect the job to execute slower because all of the partitions need to be shuffled around in order to complete the job. Eg: Group by, Distinct.
452 |
453 |
454 |
455 |
456 |
457 | *Figure 7: Shuffle vs. no-shuffle*
458 |
459 |
460 | You’ve got two control knobs of a shuffle you can use to optimize
461 | * The number of partitions being shuffled:
462 | 
463 | * The amount of partitions that you can compute in parallel.
464 | + This is equal to the number of cores in a cluster.
465 |
466 | These two determine the partition size, which we recommend should be in the Megabytes to 1 Gigabyte range. If your shuffle partitions are too small, you may be unnecessarily adding more tasks to the stage. But if they are too big, you may get bottlenecked by the network.
467 |
468 | ## Partition Your Data
469 | *Impact: High*
470 |
471 | This is a broad Big Data best practice not limited to Azure Databricks, and we mention it here because it can notably impact the performance of Databricks jobs. Storing data in partitions allows you to take advantage of partition pruning and data skipping, two very important features which can avoid unnecessary data reads. Most of the time partitions will be
472 | on a date field but you should choose your partitioning field based on the predicates most often used by your queries. For example, if you’re always going to be filtering based on “Region,” then consider partitioning your data by region.
473 | * Evenly distributed data across all partitions (date is the most common)
474 | * 10s of GB per partition (~10 to ~50GB)
475 | * Small data sets should not be partitioned
476 | * Beware of over partitioning
477 |
478 |
479 | # Running ADB Applications Smoothly: Guidelines on Observability and Monitoring
480 |
481 | > ***“Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.”***
482 | Jamie Zawinski
483 |
484 | By now we have covered planning for ADB deployments, provisioning Workspaces, selecting clusters, and deploying your applications on them. Now, let's talk about how to to monitor your Azure Databricks apps. These apps are rarely executed in isolation and need to be monitored
485 | along with a set of other services. Monitoring falls into four broad areas:
486 |
487 | 1. Resource utilization (CPU/Memory/Network) across an Azure Databricks cluster. This is referred to as VM metrics.
488 | 2. Spark metrics which enables monitoring of Spark applications to help uncover bottlenecks
489 | 3. Spark application logs which enables administrators/developers to query the logs, debug issues and investigate job run failures. This is specifically helpful to also understand exceptions across your workloads.
490 | 4. Application instrumentation which is native instrumentation that you add to your application for custom troubleshooting
491 |
492 | For the purposes of this version of the document we will focus on (1). This is the most common ask from customers.
493 |
494 | ## Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace
495 | *Impact: Medium*
496 |
497 | An important facet of monitoring is understanding the resource utilization in Azure Databricks clusters. You can also extend this to understanding utilization across all clusters in a workspace. This information is useful in arriving at the correct cluster and VM sizes. Each VM does have a set of limits (cores/disk throughput/network throughput) which play an important role in determining the performance profile of an Azure Databricks job.
498 | In order to get utilization metrics of an Azure Databricks cluster, you can stream the VM's metrics to an Azure Log Analytics Workspace (see Appendix A) by installing the Log Analytics Agent on each cluster node. Note: This could increase your cluster
499 | startup time by a few minutes.
500 |
501 |
502 | ### Querying VM metrics in Log Analytics once you have started the collection using the above document
503 |
504 | You can use Log analytics directly to query the Perf data. Here is an example of a query which charts out CPU for the VM’s in question for a specific cluster ID. See log analytics overview for further documentation on log analytics and query syntax.
505 |
506 | ```python
507 | %python
508 | script = """
509 | sed -i "s/^exit 101$/exit 0/" /usr/sbin/policy-rc.d
510 | wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w $YOUR_ID -s $YOUR_KEY -d opinsights.azure.com
511 | """
512 |
513 | #save script to databricks file system so it can be loaded by VMs
514 | dbutils.fs.put("/databricks/log_init_scripts/configure-omsagent.sh", script, True)
515 | ```
516 |
517 | 
518 |
519 | You can also use Grafana to visualize your data from Log Analytics.
520 |
521 | ## Cost Management, Chargeback and Analysis
522 |
523 | This section will focus on Azure Databricks billing, tools to manage and analyze cost and how to charge back to the team.
524 |
525 | ### Azure Databricks Billing
526 | First, it is important to understand the different workloads and tiers available with Azure Databricks. Azure Databricks is available in 2 tiers – Standard and Premium. Premium Tier offers additional features on top of what is available in Standard tier. These include Role-based access control for notebooks, jobs, and tables, Audit logs, Azure AD conditional pass-through, conditional authentication and many more. Please refer to [Azure Databricks pricing](https://azure.microsoft.com/en-us/pricing/details/databricks/) for the complete list.
527 |
528 | Both Premium and Standard tier come with 3 types of workload
529 | * Jobs Compute (previously called Data Engineering)
530 | * Jobs Light Compute (previously called Data Engineering Light)
531 | * All-purpose Compute (previously called Data Analytics)
532 | The Jobs Compute and Jobs Light Compute make it easy for data engineers to build and execute jobs, and All-purpose make it easy for data scientists to explore, visualize, manipulate, and share data and insights interactively. Depending upon the use-case, one can also use All-purpose Compute for data engineering or automated scenarios especially if the incoming job rate is higher.
533 |
534 | When you create an Azure Databricks workspace and spin up a cluster, below resources are consumed
535 | * DBUs – A DBU is a unit of processing capability, billed on a per-second usage
536 | * Virtual Machines – These represent your Databricks clusters that run the Databricks Runtime
537 | * Public IP Addresses – These represent the IP Addresses consumed by the Virtual Machines when the cluster is running
538 | * Blob Storage – Each workspace comes with a default storage
539 | * Managed Disk
540 | * Bandwidth – Bandwidth charges for any data transfer
541 |
542 | | Service or Resource | Pricing |
543 | | --- | --- |
544 | | DBUs |[DBU pricing](https://azure.microsoft.com/en-us/pricing/details/databricks/) |
545 | | VMs |[VM pricing](https://azure.microsoft.com/en-us/pricing/details/databricks/) |
546 | | Public IP Addresses |[Public IP Addresses pricing](https://azure.microsoft.com/en-us/pricing/details/ip-addresses/) |
547 | | Blob Storage |[Blob Storage pricing](https://azure.microsoft.com/en-us/pricing/details/storage/) |
548 | | Managed Disk |[Managed Disk pricing](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) |
549 | | Bandwidth |[Bandwidth pricing](https://azure.microsoft.com/en-us/pricing/details/bandwidth/) |
550 |
551 | In addition, if you use additional services as part of your end-2-end solution, such as Azure CosmosDB, or Azure Event Hub, then they are charged per their pricing plan.
552 |
553 | There are 2 pricing plans for Azure Databricks DBUs:
554 |
555 | 1. Pay as you go – Pay for the DBUs as you use: Refer to the pricing page for the DBU prices based on the SKU. The DBU per hour price for different SKUs differs across Azure public cloud, Azure Gov and Azure China region.
556 |
557 | 2. Pre-purchase or Reservations – You can get up to 37% savings over pay-as-you-go DBU when you pre-purchase Azure Databricks Units (DBU) as Databricks Commit Units (DBCU) for either 1 or 3 years. A Databricks Commit Unit (DBCU) normalizes usage from Azure Databricks workloads and tiers into to a single purchase. Your DBU usage across those workloads and tiers will draw down from the Databricks Commit Units (DBCU) until they are exhausted, or the purchase term expires. The draw down rate will be equivalent to the price of the DBU, as per the table above.
558 |
559 | Since, you are also billed for the VMs, you have both the above options for VMs as well:
560 |
561 | 1. Pay as you go
562 | 2. [Reservations](https://azure.microsoft.com/en-us/pricing/reserved-vm-instances/)
563 |
564 | Please see few examples of a billing for Azure Databricks with Pay as you go:
565 |
566 | Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute, Jobs Light Compute, or All-purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-purpose Compute workload.
567 |
568 | Accordingly, the pricing will be dependent on below components
569 | 1. DBU SKU – DBU price based on the workload and tier
570 | 2. VM SKU – VM price based on the VM SKU
571 | 3. DBU Count – Each VM SKU has an associated DBU count. Example – D3v2 has DBU count of 0.75
572 | 4. Region
573 | 5. Duration
574 |
575 | #### Example 1: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-purpose Compute:
576 | * VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598
577 | * DBU cost for All-purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100
578 | * The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698.
579 |
580 | #### Example 2: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload:
581 | * VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598
582 | * DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600
583 | * The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198.
584 |
585 | In addition to VM and DBU charges, there will be additional charges for managed disks, public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos DB depending on your application.
586 |
587 | #### Azure Databricks Trial
588 | If you are new to Azure Databricks, you can also use a Trial SKU that gives you free DBUs for Premium tier for 14 days. You will still need to pay for other resources like VM, Storage etc. that are consumed during this period. After the trial is over, you will need to start paying for the DBUs.
589 |
590 | ### Chargeback Scenarios
591 |
592 | There are 2 broad scenarios we have seen with respect to chargeback internal teams for sharing Databricks resources
593 | 1. Chargeback across a single Azure Databricks workspace: In this case, a single workspace is shared across multiple teams and user would like to chargeback the individual teams. Individual teams would use their own Databricks cluster and can be charged back at cluster level.
594 |
595 | 2. Chargeback across multiple Databricks workspace: In this case, teams use their own workspace and would like to chargeback at workspace level.
596 | To support these scenarios, Azure Databricks leverages Azure Tags so that the users can view the cost/usage for resources with tags. There are default tags that comes with the.
597 |
598 | Please see below the default tags that are available with the resources:
599 | | Resources | Default Tags |
600 | | --- | --- |
601 | | All-purpose Compute |Vendor, Creator, ClusterName, ClusterId|
602 | | Jobs Compute or Jobs Light Compute |Vendor, Creator, ClusterName, ClusterId, RunName, JobId |
603 | | Pool |Vendor, DatabricksInstancePoolId,DatabricksInstancePoolCreatorId |
604 | | Resources created during workspace creation (Storage, Worker VNet, NSG) |application, databricks-environment |
605 |
606 | In addition to the default tags, customers can add custom tags to the resources based on how they want to charge back. Both default and custom tags are displayed on Azure bills that allows one to chargeback by filtering resource usage based on tags.
607 |
608 | 1. [Cluster Tags](https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-tags): You can create custom tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to underlying cluster resources – VMs, DBUs, Public IP Addresses, Disks.
609 |
610 | 2. [Pool Tags](https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/configure#--pool-tags): You can create custom tags as key-value pairs when you create a pool, and Azure Databricks applies these tags to underlying pool resources – VMs, Public IP Addresses, Disks. Pool-backed clusters inherit default and custom tags from the pool configuration.
611 |
612 | 3. [Workspace Tags](https://docs.microsoft.com/en-us/azure/databricks/administration-guide/account-settings/usage-detail-tags-azure): You can create custom tags as key-value pairs when you create an Azure Databricks workspaces. These tags apply to underlying resources within the workspace – VMs, DBUs, and others.
613 |
614 | Please see below on how tags propagate for DBUs and VMs
615 |
616 | 1. Clusters created from pools
617 | * DBU Tag = Workspace Tag + Pool Tag + Cluster Tag
618 | * VM Tag = Workspace Tag + Pool Tag
619 |
620 | 2. Clusters not from Pools
621 | * DBU Tag = Workspace Tag + Cluster Tag
622 | * VM Tag = Workspace Tag + Cluster Tag
623 |
624 | These tags (default and custom) propagate to [Cost Analysis Reports](https://docs.microsoft.com/en-us/azure/cost-management-billing/costs/quick-acm-cost-analysis) that you can access in the Azure Portal. The below section will explain how to do cost/usage analysis using these tags.
625 |
626 | ### Cost/Usage Analysis
627 | The Cost Analysis report is available under Cost Management within Azure Portal. Please refer to [Cost Management](https://docs.microsoft.com/en-us/azure/cost-management-billing/costs/quick-acm-cost-analysis)section to get a detailed overview on how to use Cost Management.
628 |
629 | 
630 |
631 | Below example is aimed at giving a quick start to get you going to do cost analysis for Azure Databricks. Below are the steps:
632 |
633 | 1. In Azure Portal, click on Cost Management + Billing
634 | 2. In Cost Management, click on Cost Analysis Tab
635 |
636 | 
637 |
638 | 3. Choose the right billing scope that want report for and make sure the user has Cost Management Reader permission for the that scope.
639 | 4. Once selected, then you will see cost reports for all the Azure resources at that scope.
640 | 5. Post that you can create different reports by using the different options on the chart. For example, one of the reports you can create is
641 |
642 | * Chart option as Column (stacked)
643 | * Granularity – Daily
644 | * Group by – Tag – Choose clustername or clustered
645 |
646 | You will see something like below where it will show the distribution of cost on a daily basis for different clusters in your subscription or the scope that you chose in Step 3. You also have option to save this report and share it with your team.
647 |
648 | 
649 |
650 | To chargeback, you can filter this report by using the tag option. For example, you can use default tag: Creator or can use own custom tag – Cost Center and chargeback based on that.
651 |
652 | 
653 |
654 | You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below:
655 |
656 | 1. To download this data to CSV, you can set export from Cost Management + Billing -> Usage + Charges and choose Usage Details Version 2 on the right. Refer [this](https://docs.microsoft.com/en-us/azure/cost-management-billing/reservations/understand-reserved-instance-usage-ea#download-the-usage-csv-file-with-new-data) for more details. Once downloaded, you can view the cost usage data and filter based on tags to chargeback. In the CSV, you can refer the Meter Name to get the Databricks workload consumed. In addition, this is how the other fields are represented for meters related to Azure Databricks.
657 |
658 | * Quantity = Number of Virtual Machines x Number of hours x DBU count
659 | * Effective Price = DBU price based on the SKU
660 | * Cost = Quantity x Effective Price
661 |
662 | 
663 |
664 | 2. There is a native [Cost Management Connector](https://docs.microsoft.com/en-us/power-bi/connect-data/desktop-connect-azure-cost-management) in Power BI that allows one to make powerful, customized visualization and cost/usage reports.
665 |
666 | 
667 |
668 | Once you connect, you can create various rich reports easily like below by choosing the right fields from the table.
669 |
670 | Tip: To filter on tags, you will need to parse the json in Power BI. To do that, follow these steps:
671 |
672 | 1. Go to "Query Editor"
673 | 2. Select the "Usage Details" table
674 | 3. On the right side the "Properties" tab shows the steps as
675 |
676 | 
677 |
678 | 4. From the menu bar go to "Add column" -> "Add custom column"
679 | 5. Name the column and enter the following text in the query = "{"& [Tags] & "}"
680 |
681 | 
682 |
683 | 6. This will create a new column of "tags" in the json format.
684 | 7. Now user can transform it as expand it. You can then use the different tags as columns that you can use in a report.
685 |
686 | Please see some of the common views created easily using this connector.
687 |
688 | * Cost Report breakdown by Resource Group, Tags, MeterName
689 | 
690 |
691 | * Cost Report breakdown by Cluster, and custom tags
692 | 
693 |
694 | * Cost Report breakdown by Cluster and Metername in pie chart
695 | 
696 |
697 | * Cost Report breakdown by with resource group and cluster including quantity
698 | 
699 |
700 | ### Pricing difference for Regions
701 |
702 | Please refer to [Azure Databricks pricing page](https://azure.microsoft.com/en-us/pricing/details/databricks/) to get the pricing for DBU SKU and pricing discount based on Reservations. There are certain differences to consider
703 | 1.The DBU prices are different for Azure public cloud and other regions such as Azure Gov
704 | 2.The pre-purchase plan prices are different for Azure public cloud and Azure Gov
705 |
706 | ### Known Issues/Limitations
707 |
708 | 1. Tag change propagation at workspace level takes up to ~1 hour to apply to resources under Managed resource group.
709 | 2. Tag change propagation at workspace level requires cluster restart for existing running cluster, or pool expansion
710 | 3. Cost Management at parent resource group won’t show Managed RG resources consumption
711 | 4. Cost Management role assignments are not possible at Managed RG level. User today must have role assignment at parent resource group level or above (i.e. subscription) to show managed RG consumption
712 | 5. For clusters created from pool, only workspace tags and pool tags are propagated to the VMs
713 | 6. Tag keys and values can contain only characters from ISO 8859-1 set
714 | 7. Custom tag gets prefixed with x_ when it conflicts with default tag
715 | 8. Max of 50 tags can be assigned to Azure resource
716 |
717 | # Appendix A
718 |
719 | ## Installation for being able to capture VM metrics in Log Analytics
720 |
721 |
722 |
723 | #### Step 1 - Create a Log Analytics Workspace
724 | Please follow the instructions [here](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#create-a-workspace) to create a Log Analytics workspace
725 |
726 | #### Step 2- Get Log Analytics Workspace Credentials
727 | Get the workspace id and key using instructions [here.](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#obtain-workspace-id-and-key)
728 |
729 | Store these in Azure Key Vault-based Secrets backend
730 |
731 | #### Step 3 - Configure Data Collection in Log Analytics Workspace
732 | Please follow the instructions [here.](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#collect-event-and-performance-data)
733 |
734 | #### Step 4 - Configure the Init Script
735 | Replace the *LOG_ANALYTICS_WORKSPACE_ID* and *LOG_ANALYTICS_WORKSPACE_KEY* with your own info.
736 |
737 | 
738 |
739 | Now it could be used as a global script with all clusters (change the path to /databricks/init in that case), or as a cluster-scoped script with specific ones. We recommend using cluster scoped scripts as explained in this doc earlier.
740 |
741 | #### Step 5 - View Collected Data via Azure Portal
742 | See [this](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#view-data-collected) document.
743 |
744 | #### References
745 | * https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer
746 | * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/OMS-Agent-for-Linux.md
747 | * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/Troubleshooting.md
748 |
749 | ## Access patterns with Azure Data Lake Storage Gen2
750 | To understand the various access patterns and approaches to securing data in ADLS see the [following guidance](https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks/blob/master/readme.md).
751 |
752 |
--------------------------------------------------------------------------------