├── .gitignore ├── ADBicon.jpg ├── Cost Management PBI config.png ├── Cost Management PBI config2.jpg ├── Cost Management PBI report 1.png ├── Cost Management PBI report2.PNG ├── Cost Management PBI report3.PNG ├── Cost Management PBI report4.PNG ├── Cost Management Report.png ├── Cost Management Reservation.png ├── Cost Management config.png ├── Cost Management connector.png ├── Cost Management export.png ├── Cost Management filter.png ├── Cost Management.png ├── Figure1.PNG ├── Figure2.PNG ├── Figure3.PNG ├── Figure4.PNG ├── Figure5.PNG ├── Figure6.PNG ├── Figure7.PNG ├── Figure8.png ├── Grafana.PNG ├── LICENSE ├── LICENSE-CODE ├── PerfSnippet.PNG ├── Python Snippet.PNG ├── README.md ├── SECURITY.md ├── SparkSnippet.PNG ├── Table 2.PNG ├── Table1.PNG ├── Table2.PNG ├── Table3.PNG └── toc.md /.gitignore: -------------------------------------------------------------------------------- 1 | ## Ignore Visual Studio temporary files, build results, and 2 | ## files generated by popular Visual Studio add-ons. 3 | ## 4 | ## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore 5 | 6 | # User-specific files 7 | *.suo 8 | *.user 9 | *.userosscache 10 | *.sln.docstates 11 | 12 | # User-specific files (MonoDevelop/Xamarin Studio) 13 | *.userprefs 14 | 15 | # Build results 16 | [Dd]ebug/ 17 | [Dd]ebugPublic/ 18 | [Rr]elease/ 19 | [Rr]eleases/ 20 | x64/ 21 | x86/ 22 | bld/ 23 | [Bb]in/ 24 | [Oo]bj/ 25 | [Ll]og/ 26 | 27 | # Visual Studio 2015/2017 cache/options directory 28 | .vs/ 29 | # Uncomment if you have tasks that create the project's static files in wwwroot 30 | #wwwroot/ 31 | 32 | # Visual Studio 2017 auto generated files 33 | Generated\ Files/ 34 | 35 | # MSTest test Results 36 | [Tt]est[Rr]esult*/ 37 | [Bb]uild[Ll]og.* 38 | 39 | # NUNIT 40 | *.VisualState.xml 41 | TestResult.xml 42 | 43 | # Build Results of an ATL Project 44 | [Dd]ebugPS/ 45 | [Rr]eleasePS/ 46 | dlldata.c 47 | 48 | # Benchmark Results 49 | BenchmarkDotNet.Artifacts/ 50 | 51 | # .NET Core 52 | project.lock.json 53 | project.fragment.lock.json 54 | artifacts/ 55 | **/Properties/launchSettings.json 56 | 57 | # StyleCop 58 | StyleCopReport.xml 59 | 60 | # Files built by Visual Studio 61 | *_i.c 62 | *_p.c 63 | *_i.h 64 | *.ilk 65 | *.meta 66 | *.obj 67 | *.iobj 68 | *.pch 69 | *.pdb 70 | *.ipdb 71 | *.pgc 72 | *.pgd 73 | *.rsp 74 | *.sbr 75 | *.tlb 76 | *.tli 77 | *.tlh 78 | *.tmp 79 | *.tmp_proj 80 | *.log 81 | *.vspscc 82 | *.vssscc 83 | .builds 84 | *.pidb 85 | *.svclog 86 | *.scc 87 | 88 | # Chutzpah Test files 89 | _Chutzpah* 90 | 91 | # Visual C++ cache files 92 | ipch/ 93 | *.aps 94 | *.ncb 95 | *.opendb 96 | *.opensdf 97 | *.sdf 98 | *.cachefile 99 | *.VC.db 100 | *.VC.VC.opendb 101 | 102 | # Visual Studio profiler 103 | *.psess 104 | *.vsp 105 | *.vspx 106 | *.sap 107 | 108 | # Visual Studio Trace Files 109 | *.e2e 110 | 111 | # TFS 2012 Local Workspace 112 | $tf/ 113 | 114 | # Guidance Automation Toolkit 115 | *.gpState 116 | 117 | # ReSharper is a .NET coding add-in 118 | _ReSharper*/ 119 | *.[Rr]e[Ss]harper 120 | *.DotSettings.user 121 | 122 | # JustCode is a .NET coding add-in 123 | .JustCode 124 | 125 | # TeamCity is a build add-in 126 | _TeamCity* 127 | 128 | # DotCover is a Code Coverage Tool 129 | *.dotCover 130 | 131 | # AxoCover is a Code Coverage Tool 132 | .axoCover/* 133 | !.axoCover/settings.json 134 | 135 | # Visual Studio code coverage results 136 | *.coverage 137 | *.coveragexml 138 | 139 | # NCrunch 140 | _NCrunch_* 141 | .*crunch*.local.xml 142 | nCrunchTemp_* 143 | 144 | # MightyMoose 145 | *.mm.* 146 | AutoTest.Net/ 147 | 148 | # Web workbench (sass) 149 | .sass-cache/ 150 | 151 | # Installshield output folder 152 | [Ee]xpress/ 153 | 154 | # DocProject is a documentation generator add-in 155 | DocProject/buildhelp/ 156 | DocProject/Help/*.HxT 157 | DocProject/Help/*.HxC 158 | DocProject/Help/*.hhc 159 | DocProject/Help/*.hhk 160 | DocProject/Help/*.hhp 161 | DocProject/Help/Html2 162 | DocProject/Help/html 163 | 164 | # Click-Once directory 165 | publish/ 166 | 167 | # Publish Web Output 168 | *.[Pp]ublish.xml 169 | *.azurePubxml 170 | # Note: Comment the next line if you want to checkin your web deploy settings, 171 | # but database connection strings (with potential passwords) will be unencrypted 172 | *.pubxml 173 | *.publishproj 174 | 175 | # Microsoft Azure Web App publish settings. Comment the next line if you want to 176 | # checkin your Azure Web App publish settings, but sensitive information contained 177 | # in these scripts will be unencrypted 178 | PublishScripts/ 179 | 180 | # NuGet Packages 181 | *.nupkg 182 | # The packages folder can be ignored because of Package Restore 183 | **/[Pp]ackages/* 184 | # except build/, which is used as an MSBuild target. 185 | !**/[Pp]ackages/build/ 186 | # Uncomment if necessary however generally it will be regenerated when needed 187 | #!**/[Pp]ackages/repositories.config 188 | # NuGet v3's project.json files produces more ignorable files 189 | *.nuget.props 190 | *.nuget.targets 191 | 192 | # Microsoft Azure Build Output 193 | csx/ 194 | *.build.csdef 195 | 196 | # Microsoft Azure Emulator 197 | ecf/ 198 | rcf/ 199 | 200 | # Windows Store app package directories and files 201 | AppPackages/ 202 | BundleArtifacts/ 203 | Package.StoreAssociation.xml 204 | _pkginfo.txt 205 | *.appx 206 | 207 | # Visual Studio cache files 208 | # files ending in .cache can be ignored 209 | *.[Cc]ache 210 | # but keep track of directories ending in .cache 211 | !*.[Cc]ache/ 212 | 213 | # Others 214 | ClientBin/ 215 | ~$* 216 | *~ 217 | *.dbmdl 218 | *.dbproj.schemaview 219 | *.jfm 220 | *.pfx 221 | *.publishsettings 222 | orleans.codegen.cs 223 | 224 | # Including strong name files can present a security risk 225 | # (https://github.com/github/gitignore/pull/2483#issue-259490424) 226 | #*.snk 227 | 228 | # Since there are multiple workflows, uncomment next line to ignore bower_components 229 | # (https://github.com/github/gitignore/pull/1529#issuecomment-104372622) 230 | #bower_components/ 231 | 232 | # RIA/Silverlight projects 233 | Generated_Code/ 234 | 235 | # Backup & report files from converting an old project file 236 | # to a newer Visual Studio version. Backup files are not needed, 237 | # because we have git ;-) 238 | _UpgradeReport_Files/ 239 | Backup*/ 240 | UpgradeLog*.XML 241 | UpgradeLog*.htm 242 | ServiceFabricBackup/ 243 | *.rptproj.bak 244 | 245 | # SQL Server files 246 | *.mdf 247 | *.ldf 248 | *.ndf 249 | 250 | # Business Intelligence projects 251 | *.rdl.data 252 | *.bim.layout 253 | *.bim_*.settings 254 | *.rptproj.rsuser 255 | 256 | # Microsoft Fakes 257 | FakesAssemblies/ 258 | 259 | # GhostDoc plugin setting file 260 | *.GhostDoc.xml 261 | 262 | # Node.js Tools for Visual Studio 263 | .ntvs_analysis.dat 264 | node_modules/ 265 | 266 | # Visual Studio 6 build log 267 | *.plg 268 | 269 | # Visual Studio 6 workspace options file 270 | *.opt 271 | 272 | # Visual Studio 6 auto-generated workspace file (contains which files were open etc.) 273 | *.vbw 274 | 275 | # Visual Studio LightSwitch build output 276 | **/*.HTMLClient/GeneratedArtifacts 277 | **/*.DesktopClient/GeneratedArtifacts 278 | **/*.DesktopClient/ModelManifest.xml 279 | **/*.Server/GeneratedArtifacts 280 | **/*.Server/ModelManifest.xml 281 | _Pvt_Extensions 282 | 283 | # Paket dependency manager 284 | .paket/paket.exe 285 | paket-files/ 286 | 287 | # FAKE - F# Make 288 | .fake/ 289 | 290 | # JetBrains Rider 291 | .idea/ 292 | *.sln.iml 293 | 294 | # CodeRush 295 | .cr/ 296 | 297 | # Python Tools for Visual Studio (PTVS) 298 | __pycache__/ 299 | *.pyc 300 | 301 | # Cake - Uncomment if you are using it 302 | # tools/** 303 | # !tools/packages.config 304 | 305 | # Tabs Studio 306 | *.tss 307 | 308 | # Telerik's JustMock configuration file 309 | *.jmconfig 310 | 311 | # BizTalk build output 312 | *.btp.cs 313 | *.btm.cs 314 | *.odx.cs 315 | *.xsd.cs 316 | 317 | # OpenCover UI analysis results 318 | OpenCover/ 319 | 320 | # Azure Stream Analytics local run output 321 | ASALocalRun/ 322 | 323 | # MSBuild Binary and Structured Log 324 | *.binlog 325 | 326 | # NVidia Nsight GPU debugger configuration file 327 | *.nvuser 328 | 329 | # MFractors (Xamarin productivity tool) working folder 330 | .mfractor/ 331 | -------------------------------------------------------------------------------- /ADBicon.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/ADBicon.jpg -------------------------------------------------------------------------------- /Cost Management PBI config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI config.png -------------------------------------------------------------------------------- /Cost Management PBI config2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI config2.jpg -------------------------------------------------------------------------------- /Cost Management PBI report 1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report 1.png -------------------------------------------------------------------------------- /Cost Management PBI report2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report2.PNG -------------------------------------------------------------------------------- /Cost Management PBI report3.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report3.PNG -------------------------------------------------------------------------------- /Cost Management PBI report4.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management PBI report4.PNG -------------------------------------------------------------------------------- /Cost Management Report.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management Report.png -------------------------------------------------------------------------------- /Cost Management Reservation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management Reservation.png -------------------------------------------------------------------------------- /Cost Management config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management config.png -------------------------------------------------------------------------------- /Cost Management connector.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management connector.png -------------------------------------------------------------------------------- /Cost Management export.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management export.png -------------------------------------------------------------------------------- /Cost Management filter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management filter.png -------------------------------------------------------------------------------- /Cost Management.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Cost Management.png -------------------------------------------------------------------------------- /Figure1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure1.PNG -------------------------------------------------------------------------------- /Figure2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure2.PNG -------------------------------------------------------------------------------- /Figure3.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure3.PNG -------------------------------------------------------------------------------- /Figure4.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure4.PNG -------------------------------------------------------------------------------- /Figure5.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure5.PNG -------------------------------------------------------------------------------- /Figure6.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure6.PNG -------------------------------------------------------------------------------- /Figure7.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure7.PNG -------------------------------------------------------------------------------- /Figure8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Figure8.png -------------------------------------------------------------------------------- /Grafana.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Grafana.PNG -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution 4.0 International Public License 58 | 59 | By exercising the Licensed Rights (defined below), You accept and agree 60 | to be bound by the terms and conditions of this Creative Commons 61 | Attribution 4.0 International Public License ("Public License"). To the 62 | extent this Public License may be interpreted as a contract, You are 63 | granted the Licensed Rights in consideration of Your acceptance of 64 | these terms and conditions, and the Licensor grants You such rights in 65 | consideration of benefits the Licensor receives from making the 66 | Licensed Material available under these terms and conditions. 67 | 68 | 69 | Section 1 -- Definitions. 70 | 71 | a. Adapted Material means material subject to Copyright and Similar 72 | Rights that is derived from or based upon the Licensed Material 73 | and in which the Licensed Material is translated, altered, 74 | arranged, transformed, or otherwise modified in a manner requiring 75 | permission under the Copyright and Similar Rights held by the 76 | Licensor. For purposes of this Public License, where the Licensed 77 | Material is a musical work, performance, or sound recording, 78 | Adapted Material is always produced where the Licensed Material is 79 | synched in timed relation with a moving image. 80 | 81 | b. Adapter's License means the license You apply to Your Copyright 82 | and Similar Rights in Your contributions to Adapted Material in 83 | accordance with the terms and conditions of this Public License. 84 | 85 | c. Copyright and Similar Rights means copyright and/or similar rights 86 | closely related to copyright including, without limitation, 87 | performance, broadcast, sound recording, and Sui Generis Database 88 | Rights, without regard to how the rights are labeled or 89 | categorized. For purposes of this Public License, the rights 90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 91 | Rights. 92 | 93 | d. Effective Technological Measures means those measures that, in the 94 | absence of proper authority, may not be circumvented under laws 95 | fulfilling obligations under Article 11 of the WIPO Copyright 96 | Treaty adopted on December 20, 1996, and/or similar international 97 | agreements. 98 | 99 | e. Exceptions and Limitations means fair use, fair dealing, and/or 100 | any other exception or limitation to Copyright and Similar Rights 101 | that applies to Your use of the Licensed Material. 102 | 103 | f. Licensed Material means the artistic or literary work, database, 104 | or other material to which the Licensor applied this Public 105 | License. 106 | 107 | g. Licensed Rights means the rights granted to You subject to the 108 | terms and conditions of this Public License, which are limited to 109 | all Copyright and Similar Rights that apply to Your use of the 110 | Licensed Material and that the Licensor has authority to license. 111 | 112 | h. Licensor means the individual(s) or entity(ies) granting rights 113 | under this Public License. 114 | 115 | i. Share means to provide material to the public by any means or 116 | process that requires permission under the Licensed Rights, such 117 | as reproduction, public display, public performance, distribution, 118 | dissemination, communication, or importation, and to make material 119 | available to the public including in ways that members of the 120 | public may access the material from a place and at a time 121 | individually chosen by them. 122 | 123 | j. Sui Generis Database Rights means rights other than copyright 124 | resulting from Directive 96/9/EC of the European Parliament and of 125 | the Council of 11 March 1996 on the legal protection of databases, 126 | as amended and/or succeeded, as well as other essentially 127 | equivalent rights anywhere in the world. 128 | 129 | k. You means the individual or entity exercising the Licensed Rights 130 | under this Public License. Your has a corresponding meaning. 131 | 132 | 133 | Section 2 -- Scope. 134 | 135 | a. License grant. 136 | 137 | 1. Subject to the terms and conditions of this Public License, 138 | the Licensor hereby grants You a worldwide, royalty-free, 139 | non-sublicensable, non-exclusive, irrevocable license to 140 | exercise the Licensed Rights in the Licensed Material to: 141 | 142 | a. reproduce and Share the Licensed Material, in whole or 143 | in part; and 144 | 145 | b. produce, reproduce, and Share Adapted Material. 146 | 147 | 2. Exceptions and Limitations. For the avoidance of doubt, where 148 | Exceptions and Limitations apply to Your use, this Public 149 | License does not apply, and You do not need to comply with 150 | its terms and conditions. 151 | 152 | 3. Term. The term of this Public License is specified in Section 153 | 6(a). 154 | 155 | 4. Media and formats; technical modifications allowed. The 156 | Licensor authorizes You to exercise the Licensed Rights in 157 | all media and formats whether now known or hereafter created, 158 | and to make technical modifications necessary to do so. The 159 | Licensor waives and/or agrees not to assert any right or 160 | authority to forbid You from making technical modifications 161 | necessary to exercise the Licensed Rights, including 162 | technical modifications necessary to circumvent Effective 163 | Technological Measures. For purposes of this Public License, 164 | simply making modifications authorized by this Section 2(a) 165 | (4) never produces Adapted Material. 166 | 167 | 5. Downstream recipients. 168 | 169 | a. Offer from the Licensor -- Licensed Material. Every 170 | recipient of the Licensed Material automatically 171 | receives an offer from the Licensor to exercise the 172 | Licensed Rights under the terms and conditions of this 173 | Public License. 174 | 175 | b. No downstream restrictions. You may not offer or impose 176 | any additional or different terms or conditions on, or 177 | apply any Effective Technological Measures to, the 178 | Licensed Material if doing so restricts exercise of the 179 | Licensed Rights by any recipient of the Licensed 180 | Material. 181 | 182 | 6. No endorsement. Nothing in this Public License constitutes or 183 | may be construed as permission to assert or imply that You 184 | are, or that Your use of the Licensed Material is, connected 185 | with, or sponsored, endorsed, or granted official status by, 186 | the Licensor or others designated to receive attribution as 187 | provided in Section 3(a)(1)(A)(i). 188 | 189 | b. Other rights. 190 | 191 | 1. Moral rights, such as the right of integrity, are not 192 | licensed under this Public License, nor are publicity, 193 | privacy, and/or other similar personality rights; however, to 194 | the extent possible, the Licensor waives and/or agrees not to 195 | assert any such rights held by the Licensor to the limited 196 | extent necessary to allow You to exercise the Licensed 197 | Rights, but not otherwise. 198 | 199 | 2. Patent and trademark rights are not licensed under this 200 | Public License. 201 | 202 | 3. To the extent possible, the Licensor waives any right to 203 | collect royalties from You for the exercise of the Licensed 204 | Rights, whether directly or through a collecting society 205 | under any voluntary or waivable statutory or compulsory 206 | licensing scheme. In all other cases the Licensor expressly 207 | reserves any right to collect such royalties. 208 | 209 | 210 | Section 3 -- License Conditions. 211 | 212 | Your exercise of the Licensed Rights is expressly made subject to the 213 | following conditions. 214 | 215 | a. Attribution. 216 | 217 | 1. If You Share the Licensed Material (including in modified 218 | form), You must: 219 | 220 | a. retain the following if it is supplied by the Licensor 221 | with the Licensed Material: 222 | 223 | i. identification of the creator(s) of the Licensed 224 | Material and any others designated to receive 225 | attribution, in any reasonable manner requested by 226 | the Licensor (including by pseudonym if 227 | designated); 228 | 229 | ii. a copyright notice; 230 | 231 | iii. a notice that refers to this Public License; 232 | 233 | iv. a notice that refers to the disclaimer of 234 | warranties; 235 | 236 | v. a URI or hyperlink to the Licensed Material to the 237 | extent reasonably practicable; 238 | 239 | b. indicate if You modified the Licensed Material and 240 | retain an indication of any previous modifications; and 241 | 242 | c. indicate the Licensed Material is licensed under this 243 | Public License, and include the text of, or the URI or 244 | hyperlink to, this Public License. 245 | 246 | 2. You may satisfy the conditions in Section 3(a)(1) in any 247 | reasonable manner based on the medium, means, and context in 248 | which You Share the Licensed Material. For example, it may be 249 | reasonable to satisfy the conditions by providing a URI or 250 | hyperlink to a resource that includes the required 251 | information. 252 | 253 | 3. If requested by the Licensor, You must remove any of the 254 | information required by Section 3(a)(1)(A) to the extent 255 | reasonably practicable. 256 | 257 | 4. If You Share Adapted Material You produce, the Adapter's 258 | License You apply must not prevent recipients of the Adapted 259 | Material from complying with this Public License. 260 | 261 | 262 | Section 4 -- Sui Generis Database Rights. 263 | 264 | Where the Licensed Rights include Sui Generis Database Rights that 265 | apply to Your use of the Licensed Material: 266 | 267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 268 | to extract, reuse, reproduce, and Share all or a substantial 269 | portion of the contents of the database; 270 | 271 | b. if You include all or a substantial portion of the database 272 | contents in a database in which You have Sui Generis Database 273 | Rights, then the database in which You have Sui Generis Database 274 | Rights (but not its individual contents) is Adapted Material; and 275 | 276 | c. You must comply with the conditions in Section 3(a) if You Share 277 | all or a substantial portion of the contents of the database. 278 | 279 | For the avoidance of doubt, this Section 4 supplements and does not 280 | replace Your obligations under this Public License where the Licensed 281 | Rights include other Copyright and Similar Rights. 282 | 283 | 284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 285 | 286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 296 | 297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 306 | 307 | c. The disclaimer of warranties and limitation of liability provided 308 | above shall be interpreted in a manner that, to the extent 309 | possible, most closely approximates an absolute disclaimer and 310 | waiver of all liability. 311 | 312 | 313 | Section 6 -- Term and Termination. 314 | 315 | a. This Public License applies for the term of the Copyright and 316 | Similar Rights licensed here. However, if You fail to comply with 317 | this Public License, then Your rights under this Public License 318 | terminate automatically. 319 | 320 | b. Where Your right to use the Licensed Material has terminated under 321 | Section 6(a), it reinstates: 322 | 323 | 1. automatically as of the date the violation is cured, provided 324 | it is cured within 30 days of Your discovery of the 325 | violation; or 326 | 327 | 2. upon express reinstatement by the Licensor. 328 | 329 | For the avoidance of doubt, this Section 6(b) does not affect any 330 | right the Licensor may have to seek remedies for Your violations 331 | of this Public License. 332 | 333 | c. For the avoidance of doubt, the Licensor may also offer the 334 | Licensed Material under separate terms or conditions or stop 335 | distributing the Licensed Material at any time; however, doing so 336 | will not terminate this Public License. 337 | 338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 339 | License. 340 | 341 | 342 | Section 7 -- Other Terms and Conditions. 343 | 344 | a. The Licensor shall not be bound by any additional or different 345 | terms or conditions communicated by You unless expressly agreed. 346 | 347 | b. Any arrangements, understandings, or agreements regarding the 348 | Licensed Material not stated herein are separate from and 349 | independent of the terms and conditions of this Public License. 350 | 351 | 352 | Section 8 -- Interpretation. 353 | 354 | a. For the avoidance of doubt, this Public License does not, and 355 | shall not be interpreted to, reduce, limit, restrict, or impose 356 | conditions on any use of the Licensed Material that could lawfully 357 | be made without permission under this Public License. 358 | 359 | b. To the extent possible, if any provision of this Public License is 360 | deemed unenforceable, it shall be automatically reformed to the 361 | minimum extent necessary to make it enforceable. If the provision 362 | cannot be reformed, it shall be severed from this Public License 363 | without affecting the enforceability of the remaining terms and 364 | conditions. 365 | 366 | c. No term or condition of this Public License will be waived and no 367 | failure to comply consented to unless expressly agreed to by the 368 | Licensor. 369 | 370 | d. Nothing in this Public License constitutes or may be interpreted 371 | as a limitation upon, or waiver of, any privileges and immunities 372 | that apply to the Licensor or You, including from the legal 373 | processes of any jurisdiction or authority. 374 | 375 | 376 | ======================================================================= 377 | 378 | Creative Commons is not a party to its public 379 | licenses. Notwithstanding, Creative Commons may elect to apply one of 380 | its public licenses to material it publishes and in those instances 381 | will be considered the “Licensor.” The text of the Creative Commons 382 | public licenses is dedicated to the public domain under the CC0 Public 383 | Domain Dedication. Except for the limited purpose of indicating that 384 | material is shared under a Creative Commons public license or as 385 | otherwise permitted by the Creative Commons policies published at 386 | creativecommons.org/policies, Creative Commons does not authorize the 387 | use of the trademark "Creative Commons" or any other trademark or logo 388 | of Creative Commons without its prior written consent including, 389 | without limitation, in connection with any unauthorized modifications 390 | to any of its public licenses or any other arrangements, 391 | understandings, or agreements concerning use of licensed material. For 392 | the avoidance of doubt, this paragraph does not form part of the 393 | public licenses. 394 | 395 | Creative Commons may be contacted at creativecommons.org. -------------------------------------------------------------------------------- /LICENSE-CODE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | Copyright (c) Microsoft Corporation 3 | 4 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and 5 | associated documentation files (the "Software"), to deal in the Software without restriction, 6 | including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, 7 | and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, 8 | subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all copies or substantial 11 | portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 14 | NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 15 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 16 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE 17 | SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /PerfSnippet.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/PerfSnippet.PNG -------------------------------------------------------------------------------- /Python Snippet.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Python Snippet.PNG -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Azure Databricks Best Practices 2 | 3 | ![Azure Databricks](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/ADBicon.jpg "Azure Databricks") 4 | 5 | 6 | Authors: 7 | Dhruv Kumar, Senior Solutions Architect, Databricks 8 | Premal Shah, Azure Databricks PM, Microsoft 9 | Bhanu Prakash, Azure Databricks PM, Microsoft 10 | 11 | Written by: Priya Aswani, WW Data Engineering & AI Technical Lead 12 | 13 | 14 | Published: June 22, 2019 15 | 16 | 17 | Version: 1.0 18 | 19 | 20 | [Click here for the Best Practices](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/toc.md) 21 | 22 | Disclaimers: 23 | 24 | This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. 25 | 26 | Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. 27 | 28 | This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. 29 | 30 | © 2019 Microsoft. All rights reserved. 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | # Contributing 43 | 44 | This project welcomes contributions and suggestions. Most contributions require you to agree to a 45 | Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us 46 | the rights to use your contribution. For details, visit https://cla.microsoft.com. 47 | 48 | When you submit a pull request, a CLA-bot will automatically determine whether you need to provide 49 | a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions 50 | provided by the bot. You will only need to do this once across all repos using our CLA. 51 | 52 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 53 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or 54 | contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. 55 | 56 | # Legal Notices 57 | 58 | Microsoft and any contributors grant you a license to the Microsoft documentation and other content 59 | in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode), 60 | see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the 61 | [LICENSE-CODE](LICENSE-CODE) file. 62 | 63 | Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation 64 | may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. 65 | The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. 66 | Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653. 67 | 68 | Privacy information can be found at https://privacy.microsoft.com/en-us/ 69 | 70 | Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, 71 | or trademarks, whether by implication, estoppel or otherwise. 72 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Security 4 | 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). 6 | 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/opensource/security/definition), please report it to us as described below. 8 | 9 | ## Reporting Security Issues 10 | 11 | **Please do not report security vulnerabilities through public GitHub issues.** 12 | 13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/opensource/security/create-report). 14 | 15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/opensource/security/pgpkey). 16 | 17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://aka.ms/opensource/security/msrc). 18 | 19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: 20 | 21 | * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) 22 | * Full paths of source file(s) related to the manifestation of the issue 23 | * The location of the affected source code (tag/branch/commit or direct URL) 24 | * Any special configuration required to reproduce the issue 25 | * Step-by-step instructions to reproduce the issue 26 | * Proof-of-concept or exploit code (if possible) 27 | * Impact of the issue, including how an attacker might exploit the issue 28 | 29 | This information will help us triage your report more quickly. 30 | 31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/opensource/security/bounty) page for more details about our active programs. 32 | 33 | ## Preferred Languages 34 | 35 | We prefer all communications to be in English. 36 | 37 | ## Policy 38 | 39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/opensource/security/cvd). 40 | 41 | 42 | -------------------------------------------------------------------------------- /SparkSnippet.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/SparkSnippet.PNG -------------------------------------------------------------------------------- /Table 2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table 2.PNG -------------------------------------------------------------------------------- /Table1.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table1.PNG -------------------------------------------------------------------------------- /Table2.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table2.PNG -------------------------------------------------------------------------------- /Table3.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Azure/AzureDatabricksBestPractices/7df384b6bf8752a43bcb1d476712da8425fc67e2/Table3.PNG -------------------------------------------------------------------------------- /toc.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 |

6 | 7 |

8 | 9 | # Azure Databricks Best Practices 10 | 11 | Authors: 12 | * Dhruv Kumar, Senior Solutions Architect, Databricks 13 | * Premal Shah, Azure Databricks PM, Microsoft 14 | * Bhanu Prakash, Azure Databricks PM, Microsoft 15 | 16 | Written by: Priya Aswani, WW Data Engineering & AI Technical Lead 17 | 18 | # Table of Contents 19 | 20 | - [Introduction](#Introduction) 21 | - [Scalable ADB Deployments: Guidelines for Networking, Security, and Capacity Planning](#scalable-ADB-Deployments-Guidelines-for-Networking-Security-and-Capacity-Planning) 22 | * [Azure Databricks 101](#Azure-Databricks-101) 23 | * [Map Workspaces to Business Units](#Map-Workspaces-to-Business-Divisions) 24 | * [Deploy Workspaces in Multiple Subscriptions to Honor Azure Capacity Limits](#Deploy-Workspaces-in-Multiple-Subscriptions-to-Honor-Azure-Capacity-Limits) 25 | + [ADB Workspace Limits](#ADB-Workspace-Limits) 26 | + [Azure Subscription Limits](#Azure-Subscription-Limits) 27 | * [Consider Isolating Each Workspace in its own VNet](#Consider-Isolating-Each-Workspace-in-its-own-VNet) 28 | * [Select the Largest Vnet CIDR](#Select-the-Largest-Vnet-CIDR) 29 | * [Azure Databricks Deployment with limited private IP addresses](#Azure-Databricks-Deployment-with-limited-private-IP-addresses) 30 | * [Do not Store any Production Data in Default DBFS Folders](#Do-not-Store-any-Production-Data-in-Default-DBFS-Folders) 31 | * [Always Hide Secrets in a Key Vault](#Always-Hide-Secrets-in-a-Key-Vault) 32 | - [Deploying Applications on ADB: Guidelines for Selecting, Sizing, and Optimizing Clusters Performance](#Deploying-Applications-on-ADB-Guidelines-for-Selecting-Sizing-and-Optimizing-Clusters-Performance) 33 | * [Support Interactive analytics using Shared High Concurrency Clusters](#support-interactive-analytics-using-shared-high-concurrency-clusters) 34 | * [Support Batch ETL Workloads with Single User Ephemeral Standard Clusters](#support-batch-etl-workloads-with-single-user-ephemeral-standard-clusters) 35 | * [Favor Cluster Scoped Init scripts over Global and Named scripts](#favor-cluster-scoped-init-scripts-over-global-and-named-scripts) 36 | * [Use Cluster Log Delivery Feature to Manage Logs](#Use-Cluster-Log-Delivery-Feature-to-Manage-Logs) 37 | * [Choose VMs to Match Workload](#Choose-VMs-to-Match-Workload) 38 | * [Arrive at Correct Cluster Size by Iterative Performance Testing](#Arrive-at-correct-cluster-size-by-iterative-performance-testing) 39 | * [Tune Shuffle for Optimal Performance](#Tune-shuffle-for-optimal-performance) 40 | * [Partition Your Data](#partition-your-data) 41 | - [Running ADB Applications Smoothly: Guidelines on Observability and Monitoring](#Running-ADB-Applications-Smoothly-Guidelines-on-Observability-and-Monitoring) 42 | * [Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace](#Collect-resource-utilization-metrics-across-Azure-Databricks-cluster-in-a-Log-Analytics-workspace) 43 | + [Querying VM metrics in Log Analytics once you have started the collection using the above document](#Querying-VM-metrics-in-log-analytics-once-you-have-started-the-collection-using-the-above-document) 44 | - [Cost Management, Chargeback and Analysis](#Cost-Management-Chargeback-and-Analysis) 45 | - [Appendix A](#Appendix-A) 46 | * [Installation for being able to capture VM metrics in Log Analytics](#Installation-for-being-able-to-capture-VM-metrics-in-Log-Analytics) 47 | + [Overview](#Overview) 48 | + [Step 1 - Create a Log Analytics Workspace](#step-1---create-a-log-analytics-workspace) 49 | + [Step 2 - Get Log Analytics Workspace Credentials](#step-2--get-log-analytics-workspace-credentials) 50 | + [Step 3 - Configure Data Collection in Log Analytics Workspace](#step-3---configure-data-collection-in-log-analytics-workspace) 51 | + [Step 4 - Configure the Init Script](#Step-4---Configure-the-Init-script) 52 | + [Step 5 - View Collected Data via Azure Portal](#Step-5---View-Collected-Data-via-Azure-Portal) 53 | + [References](#References) 54 | * [Access patterns with Azure Data Lake Storage Gen2](#Access-patterns-with-Azure-Data-Lake-Storage-Gen2) 55 | 56 | 57 | 58 | 66 | 67 | 68 | 69 | 70 | > ***"A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away."*** 71 | Antoine de Saint-Exupéry 72 | 73 | 74 | 75 | ## Introduction 76 | 77 | Planning, deploying, and running Azure Databricks (ADB) at scale requires one to make many architectural decisions. 78 | 79 | While each ADB deployment is unique to an organization's needs we have found that some patterns are common across most successful ADB projects. Unsurprisingly, these patterns are also in-line with modern Cloud-centric development best practices. 80 | 81 | This short guide summarizes these patterns into prescriptive and actionable best practices for Azure Databricks. We follow a logical path of planning the infrastructure, provisioning the workspaces,developing Azure Databricks applications, and finally, running Azure Databricks in production. 82 | 83 | The audience of this guide are system architects, field engineers, and development teams of customers, Microsoft, and Databricks. Since the Azure Databricks product goes through fast iteration cycles, we have avoided recommendations based on roadmap or Private Preview features. 84 | 85 | Our recommendations should apply to a typical Fortune 500 enterprise with at least intermediate level of Azure and Databricks knowledge. We've also classified each recommendation according to its likely impact on solution's quality attributes. Using the **Impact** factor, you can weigh the recommendation against other competing choices. Example: if the impact is classified as “Very High”, the implications of not adopting the best practice can have a significant impact on your deployment. 86 | 87 | **Important Note**: This guide is intended to be used with the detailed [Azure Databricks Documentation](https://docs.azuredatabricks.net/index.html) 88 | 89 | ## Scalable ADB Deployments: Guidelines for Networking, Security, and Capacity Planning 90 | 91 | Azure Databricks (ADB) deployments for very small organizations, PoC applications, or for personal education hardly require any planning. You can spin up a Workspace using Azure Portal in a matter of minutes, create a Notebook, and start writing code. 92 | 93 | Enterprise-grade large scale deployments are a different story altogether. Some upfront planning is necessary to manage Azure Databricks deployments across large teams. In particular, you need to understand: 94 | 95 | * Networking requirements of Databricks 96 | * The number and the type of Azure networking resources required to launch clusters 97 | * Relationship between Azure and Databricks jargon: Subscription, VNet., Workspaces, Clusters, Subnets, etc. 98 | * Overall Capacity Planning process: where to begin, what to consider? 99 | 100 | 101 | Let’s start with a short Azure Databricks 101 and then discuss some best practices for scalable and secure deployments. 102 | 103 | ## Azure Databricks 101 104 | 105 | ADB is a Big Data analytics service. Being a Cloud Optimized managed [PaaS](https://azure.microsoft.com/en-us/overview/what-is-paas/) offering, it is designed to hide the underlying distributed systems and networking complexity as much as possible from the end user. It is backed by a team of support staff who monitor its health, debug tickets filed via Azure, etc. This allows ADB users to focus on developing value generating apps rather than stressing over infrastructure management. 106 | 107 | You can deploy ADB using Azure Portal or using [ARM templates](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-overview#template-deployment). One successful ADB deployment produces exactly one Workspace, a space where users can log in and author analytics apps. It comprises the file browser, notebooks, tables, clusters, [DBFS](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html#dbfs) storage, etc. More importantly, Workspace is a fundamental isolation unit in Databricks. All workspaces are completely isolated from each other. 108 | 109 | Each workspace is identified by a globally unique 53-bit number, called ***Workspace ID or Organization ID***. The URL that a customer sees after logging in always uniquely identifies the workspace they are using. 110 | 111 | *https://adb-workspaceId.azuredatabricks.net/?o=workspaceId* 112 | 113 | Example: *https://adb-12345.eastus2.azuredatabricks.net/?o=12345* 114 | 115 | Azure Databricks uses [Azure Active Directory (AAD)](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-whatis) as the exclusive Identity Provider and there’s a seamless out of the box integration between them. This makes ADB tightly integrated with Azure just like its other core services. Any AAD member assigned to the Owner or Contributor role can deploy Databricks and is automatically added to the ADB members list upon first login. If a user is not a member or guest of the Active Directory tenant, they can’t login to the workspace. 116 | Granting access to a user in another tenant (for example, if contoso.com wants to collaborate with adventure-works.com users) does work because those external users are added as guests to the tenant hosting Azure Databricks. 117 | 118 | Azure Databricks comes with its own user management interface. You can create users and groups in a workspace, assign them certain privileges, etc. While users in AAD are equivalent to Databricks users, by default AAD roles have no relationship with groups created inside ADB, unless you use [SCIM](https://docs.azuredatabricks.net/administration-guide/admin-settings/scim/aad.html) for provisioning users and groups. With SCIM, you can import both groups and users from AAD into Azure Databricks, and the synchronization is automatic after the initial import. ADB also has a special group called ***Admins***, not to be confused with AAD’s role Admin. 119 | 120 | The first user to login and initialize the workspace is the workspace ***owner***, and they are automatically assigned to the Databricks admin group. This person can invite other users to the workspace, add them as admins, create groups, etc. The ADB logged in user’s identity is provided by AAD, and shows up under the user menu in Workspace: 121 | 122 |

123 | 124 |

125 | Figure 1: Databricks user menu 126 | 127 | 128 | Multiple clusters can exist within a workspace, and there’s a one-to-many mapping between a Subscription to Workspaces, and further, from one Workspace to multiple Clusters. 129 | 130 | ![Figure 2: Azure Databricks Isolation Domains Workspace](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Figure3.PNG "Figure 3: Azure Databricks Isolation Domains Workspace") 131 | 132 | *Figure 2: Relationship Between AAD, Workspace, Resource Groups, and Clusters 133 | 134 | With this basic understanding let’s discuss how to plan a typical ADB deployment. We first grapple with the issue of how to divide workspaces and assign them to users and teams. 135 | 136 | 137 | ## Map Workspaces to Business Divisions 138 | *Impact: Very High* 139 | 140 | How many workspaces do you need to deploy? The answer to this question depends a lot on your organization’s structure. We recommend that you assign workspaces based on a related group of people working together collaboratively. This also helps in streamlining your access control matrix within your workspace (folders, notebooks etc.) and also across all your resources that the workspace interacts with (storage, related data stores like Azure SQL DB, Azure SQL DW etc.). This type of division scheme is also known as the [Business Unit Subscription](https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/decision-guides/subscriptions/) design pattern and it aligns well with the Databricks chargeback model. 141 | 142 | 143 |

144 | 145 |

146 | 147 | *Figure 3: Business Unit Subscription Design Pattern* 148 | 149 | ## Deploy Workspaces in Multiple Subscriptions to Honor Azure Capacity Limits 150 | *Impact: Very High* 151 | 152 | Customers commonly partition workspaces based on teams or departments and arrive at that division naturally. But it is also important to partition keeping Azure Subscription and ADB Workspace limits in mind. 153 | 154 | ### Databricks Workspace Limits 155 | Azure Databricks is a multitenant service and to provide fair resource sharing to all regional customers, it imposes limits on API calls. These limits are expressed at the Workspace level and are due to internal ADB components. For instance, you can only run up to 1000 concurrent jobs in a workspace. Beyond that, ADB will deny your job submissions. There are also other limits such as max hourly job submissions, max notebooks, etc. 156 | 157 | Key workspace limits are: 158 | 159 | * The maximum number of jobs that a workspace can create in an hour is **5000** 160 | * At any time, you cannot have more than **1000 jobs** simultaneously running in a workspace 161 | * There can be a maximum of **145 notebooks** attached to a cluster 162 | 163 | ### Azure Subscription Limits 164 | Next, there are [Azure limits](https://docs.microsoft.com/en-us/azure/azure-subscription-service-limits) to consider since ADB deployments are built on top of the Azure infrastructure. 165 | 166 | For more help in understanding the impact of these limits or options of increasing them, please contact Microsoft or Databricks technical architects. 167 | 168 | > ***We highly recommend separating workspaces into production and dev, and deploying them into separate subscriptions.*** 169 | 170 | ## Consider Isolating Each Workspace in its own VNet 171 | *Impact: Low* 172 | 173 | While you can deploy more than one Workspace in a VNet by keeping the associated subnet pairs separate from other workspaces, we recommend that you should only deploy one workspace in any Vnet. Doing this perfectly aligns with the ADB's Workspace level isolation model. Most often organizations consider putting multiple workspaces in the same Vnet so that they all can share some common networking resource, like DNS, also placed in the same Vnet because the private address space in a vnet is shared by all resources. You can easily achieve the same while keeping the Workspaces separate by following the [hub and spoke model](https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/hybrid-networking/hub-spoke) and using Vnet Peering to extend the private IP space of the workspace Vnet. Here are the steps: 174 | 1. Deploy each Workspace in its own spoke VNet. 175 | 2. Put all the common networking resources in a central hub Vnet, such as your custom DNS server. 176 | 3. Join the Workspace spokes with the central networking hub using [Vnet Peering](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-peering.html) 177 | 178 | More information: [Azure Virtual Datacenter: a network perspective](https://docs.microsoft.com/en-us/azure/architecture/vdc/networking-virtual-datacenter#topology) 179 | 180 |

181 | 182 |

183 | 184 | *Figure 4: Hub and Spoke Model* 185 | 186 | ## Select the Largest Vnet CIDR 187 | *Impact: Very High* 188 | 189 | > ***This recommendation only applies if you're using the Bring Your Own Vnet feature.*** 190 | 191 | Recall the each Workspace can have multiple clusters. The total capacity of clusters in each workspace is a function of the masks used for the workspace's enclosing Vnet and the pair of subnets associated with each cluster in the workspace. The masks can be changed if you use the [Bring Your Own Vnet](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/vnet-inject.html#vnet-inject) feature as it gives you more control over the networking layout. It is important to understand this relationship for accurate capacity planning. 192 | 193 | * Each cluster node requires 1 Public IP and 2 Private IPs 194 | * These IPs are logically grouped into 2 subnets named “public” and “private” 195 | * For a desired cluster size of X: number of Public IPs = X, number of Private IPs = 2X 196 | * The size of private and public subnets thus determines total number of VMs available for clusters 197 | + /22 mask is larger than /23, so setting private and public to /22 will have more VMs available for creating clusters, than say /23 or below 198 | * But, because of the address space allocation scheme, the size of private and public subnets is constrained by the VNet’s CIDR 199 | * The allowed values for the enclosing VNet CIDR are from /16 through /24 200 | * The private and public subnet masks must be: 201 | + Equal 202 | + Must be greater than /26 203 | 204 | With this info, we can quickly arrive at the table below, showing how many nodes one can use across all clusters for a given VNet CIDR. It is clear that selection of VNet CIDR has far reaching implications in terms of maximum cluster size. 205 | 206 | | Enclosing Vnet CIDR’s Mask where ADB Workspace is deployed | Allowed Masks on Private and Public Subnets (should be equal) | Max number of nodes across all clusters in the Workspace, assuming higher subnet mask is chosen | 207 | |:------------------------------------------------------------:|:---------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------:| 208 | | /16 | /17 through /26 | 32766 | 209 | | /17 | /18 through /26 | 16382 | 210 | | /18 | /19 through /26 | 8190 | 211 | | /19 | /20 through /26 | 4094 | 212 | | /20 | /21 through /26 | 2046 | 213 | | /21 | /22 through /26 | 1022 | 214 | | /22 | /23 through /26 | 510 | 215 | | /23 | /24 through /26 | 254 | 216 | | /24 | /25 or /26 | 126 | 217 | 218 | 219 | 220 | # Azure Databricks Deployment with limited private IP addresses 221 | *Impact: High* 222 | 223 | Depending where data sources are located, Azure Databricks can be deployed in a connected or disconnected scenario. In a connected scenario, Azure Databricks must be able to reach directly data sources located in Azure VNets or on-premises locations. In a disconnected scenario, data can be copied to a storage platform (such as an Azure Data Lake Storage account), to which Azure Databricks can be connected to using mount points. 224 | ***This section will cover a scenario to deploy Azure Databricks when there are limited private IP addresses and Azure Databricks can be configured to access data using mount points (disconnected scenario).*** 225 | 226 | Many multi-national enterprise organizations are building platforms in Azure, based on the hub and spoke network architecture, which is a model that maps to the recommended Azure Databricks deployments, which is to deploy only one workspace in any VNet by implementing the hub and spoke network architecture. Workspaces are deployed on the spokes, while shared networking and security resources such as ExpressRoute connectivity or DNS infrastructure is deployed in the hub. 227 | Customer who have exhausted (or are near to exhaust) RFC1918 IP address ranges, have to optimize address space for spoke VNets, and may only be able to provide small VNets for most cases (/25 or smaller), and only in exceptional cases they may provide a larger VNet (such as a /24). 228 | 229 | As the smallest Azure Databricks deployment requires a /24 VNet, such customers require an alternative solution, so that the business can deploy one or multiple Azure Databricks clusters across multiple VNets (as required by the business), but also, they should be able to create larger clusters, which would require larger VNet address space. 230 | 231 | A recommended Azure Databricks implementation, which would ensure minimal RFC1918 addresses are used, while at the same time, would allow the business users to deploy as many Azure Databricks clusters as they want and as small or large as they need them, consist on the following environments within the same Azure subscription as depicted in the picture below: 232 | 233 |

234 | 235 | 236 |

237 | 238 | 239 | *Figure 8: Network Topology* 240 | 241 | 242 | As the diagram depicts, the business application subscription where Azure Databricks will be deployed, has two VNets, one that is routable to on-premises and the rest of the Azure environment (this can be a small VNet such as /26), and includes the following Azure data resources: Azure Data Factory and ADLS Gen2 (via Private Endpoint). 243 | > ***Note: While we use Azure Data Factory on this implementation, any other service that can perform similar functionality could be used.*** 244 | 245 | 246 | 247 | The other VNet is fully disconnected and is not routable to the rest of the environment, and on this VNet Databricks and optionally Azure Bastion (to be able to perform management via jumpboxes) is deployed, as well as a Private Endpoint to the ADLS Gen2 storage, so that Databricks can retrieve data for ingestion. This setup is described in further details below: 248 | 249 | 250 | 251 | **Connected (routable environment)** 252 | * In a business application subscription, deploy a VNet with RFC1918 addresses which is fully routable in Azure and cross-premises via ExpressRoute. This VNet can be a small VNet, such as /26 or /27. 253 | * This VNet, is connected to a central hub VNet via VNet peering to have connectivity across Azure and on-premises via ExpressRoute or VPN. 254 | * UDR with default route (0.0.0.0/0) points to a central NVA (for example, Azure Firewall) for internet outbound traffic. 255 | * NSGs are configured to block inbound traffic from the internet. 256 | * Azure Data Lake (ADLS) Gen2 is deployed in the business application subscription. 257 | * A Private Endpoint is created on the VNet to make ADLS Gen 2 storage accessible from on-premises and from Azure VNets via a private IP address. 258 | * Azure Data Factory will be responsible for the process of moving data from the source locations (other spoke VNets or on-premises) into the ADLS Gen2 store (accessible via Private Endpoint). 259 | * Azure Data Factory (ADF) is deployed on this routable VNet 260 | * Azure Data Factory components require a compute infrastructure to run on and this is referred to as Integration Runtime. In the mentioned scenario, moving data from on-premises data sources to Azure Data Services (accessible via Private Endpoint), it is required a Self-Hosted Integration Runtime. 261 | * The Self-Hosted Integration Runtime needs to be installed on an Azure Virtual Machine inside the routable VNET in order to allow Azure Data Factory to communicate with the source data and destination data. 262 | * Considering this, Azure Data Factory only requires 1 IP address (and maximum up to 4 IP addresses) in the VNet (via the integration runtime). 263 | 264 | 265 | 266 | **Disconnected (non-routable environment)** 267 | * In the same business application subscription, deploy a VNet with any RFC1918 address space that is desired by the application team (for example, 10.0.0.0/16) 268 | * This VNet is not going to be connected to the rest of the environment. In other words, this will be a disconnected and fully isolated VNet. 269 | * This VNet includes 3 required and 3 optional subnets: 270 | * 2x of them dedicated exclusively to the Azure Databricks Workspace (private-subnet and public-subnet) 271 | * 1x which will be used for the private link to the ADLS Gen2 272 | * (Optional) 1x for Azure Bastion 273 | * (Optional) 1x for jumpboxes 274 | * (Optional but recommended) 1x for Azure Firewall (or other network security NVA). 275 | * Azure Databricks is deployed on this disconnected VNet. 276 | * Azure Bastion is deployed on this disconnected VNet, to allow Azure Databricks administration via jumpboxes. 277 | * Azure Firewall (or another network security NVA) is deployed on this disconnected VNet to secure internet outbound traffic. 278 | * NSGs are used to lockdown traffic across subnets. 279 | * 2x Private Endpoints are created on this disconnected VNet to make the ADLS Gen2 storage accessible for the Databricks cluster: 280 | * 1x private endpoint having the target sub-resource *blob* 281 | * 1x private endpoint having the target sub-resource *dfs* 282 | * Databricks integrates with ADLS Gen2 storage for data ingestion 283 | 284 | 285 | 286 | ## Do not Store any Production Data in Default DBFS Folders 287 | *Impact: High* 288 | 289 | This recommendation is driven by security and data availability concerns. Every Workspace comes with a default DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You should not store any production data in it, because: 290 | 1. The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contents. 291 | 2. One can't restrict access to this default folder and its contents. 292 | 293 | > ***This recommendation doesn't apply to Blob or ADLS folders explicitly mounted as DBFS by the end user*** 294 | 295 | **More Information:** 296 | [Databricks File System](https://docs.azuredatabricks.net/user-guide/dbfs-databricks-file-system.html) 297 | 298 | 299 | ## Always Hide Secrets in a Key Vault 300 | *Impact: High* 301 | 302 | It is a significant security risk to expose sensitive data such as access credentials openly in Notebooks or other places such as job configs, init scripts, etc. You should always use a vault to securely store and access them. 303 | You can either use ADB’s internal Key Vault for this purpose or use Azure’s Key Vault (AKV) service. 304 | 305 | If using Azure Key Vault, create separate AKV-backed secret scopes and corresponding AKVs to store credentials pertaining to different data stores. This will help prevent users from accessing credentials that they might not have access to. Since access controls are applicable to the entire secret scope, users with access to the scope will see all secrets for the AKV associated with that scope. 306 | 307 | **More Information:** 308 | 309 | [Create an Azure Key Vault-backed secret scope](https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html) 310 | 311 | [Example of using secret in a notebook](https://docs.azuredatabricks.net/user-guide/secrets/example-secret-workflow.html) 312 | 313 | [Best practices for creating secret scopes](https://docs.azuredatabricks.net/user-guide/secrets/secret-acl.html) 314 | 315 | # Deploying Applications on ADB: Guidelines for Selecting, Sizing, and Optimizing Clusters Performance 316 | 317 | > ***"Any organization that designs a system will inevitably produce a design whose structure is a copy of the organization's communication structure."*** 318 | Mead Conway 319 | 320 | 321 | After understanding how to provision the workspaces, best practices in networking, etc., let’s put on the developer’s hat and see the design choices typically faced by them: 322 | 323 | * What type of clusters should I use? 324 | * How many drivers and how many workers? 325 | * Which Azure VMs should I select? 326 | 327 | In this chapter we will address such concerns and provide our recommendations, while also explaining the internals of Databricks clusters and associated topics. Some of these ideas seem counterintuitive but they will all make sense if you keep these important design attributes of the ADB service in mind: 328 | 329 | 1. **Cloud Optimized:** Azure Databricks is a product built exclusively for cloud environments, like Azure. No on-prem deployments currently exist. It assumes certain features are provided by the Cloud, is designed keeping Cloud best practices, and conversely, provides Cloud-friendly features. 330 | 2. **Platform/Software as a Service Abstraction:** ADB sits somewhere between the PaaS and SaaS ends of the spectrum, depending on how you use it. In either case ADB is designed to hide infrastructure details as much as possible so the user can focus on application development. It is 331 | not, for example, an IaaS offering exposing the guts of the OS Kernel to you. 332 | 3. **Managed Service:** ADB guarantees a 99.95% uptime SLA. There’s a large team of dedicated staff members who monitor various aspects of its health and get alerted when something goes wrong. It is run like an always-on website and Microsoft and Databricks system operations team strives to minimize any downtime. 333 | 334 | These three attributes make ADB very different than other Spark platforms such as HDP, CDH, Mesos, etc. which are designed for on-prem datacenters and allow the user complete control over the hardware. The concept of a cluster is therefore pretty unique in Azure Databricks. Unlike YARN or Mesos clusters which are just a collection of worker machines waiting for an application to be scheduled on them, clusters in ADB come with a pre-configured Spark application. ADB submits all subsequent user requests 335 | like notebook commands, SQL queries, Java jar jobs, etc. to this primordial app for execution. 336 | 337 | Under the covers Databricks clusters use the lightweight Spark Standalone resource allocator. 338 | 339 | 340 | When it comes to taxonomy, ADB clusters are divided along the notions of “type”, and “mode.” There are two ***types*** of ADB clusters, according to how they are created. Clusters created using UI and [Clusters API](https://docs.azuredatabricks.net/api/latest/clusters.html) are called Interactive Clusters, whereas those created using [Jobs API](https://docs.azuredatabricks.net/api/latest/jobs.html) are called Jobs Clusters. Further, each cluster can be of two ***modes***: Standard and High Concurrency. Regardless of types or mode, all clusters in Azure Databricks can automatically scale to match the workload, using a feature known as [Autoscaling](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling). 341 | 342 | 343 | *Table 2: Cluster modes and their characteristics* 344 | 345 | ![Table 2: Cluster modes and their characteristics](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Table2.PNG "Table 2: Cluster modes and their characteristics") 346 | 347 | ## Support Interactive Analytics Using Shared High Concurrency Clusters 348 | *Impact: Medium* 349 | 350 | There are three steps for supporting Interactive workloads on ADB: 351 | 1. Deploy a shared cluster instead of letting each user create their own cluster. 352 | 2. Create the shared cluster in High Concurrency mode instead of Standard mode. 353 | 3. Configure security on the shared High Concurrency cluster, using **one** of the following options: 354 | * Turn on [AAD Credential Passthrough](https://docs.azuredatabricks.net/administration-guide/cloud-configurations/azure/credential-passthrough.html#enabling-azure-ad-credential-passthrough-to-adls) if you’re using ADLS 355 | * Turn on Table Access Control for all other stores 356 | 357 | To understand why, let’s quickly see how interactive workloads are different from batch workloads: 358 | 359 | *Table 3: Batch vs. Interactive workloads* 360 | ![Table 3: Batch vs. Interactive workloads](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Table3.PNG "Table 3: Batch vs. Interactive workloads") 361 | 362 | Because of these differences, supporting Interactive workloads entails minimizing cost variability and optimizing for latency over throughput, while providing a secure environment. These goals are satisfied by shared High Concurrency clusters with Table access controls or AAD Passthrough turned on (in case of ADLS): 363 | 364 | 1. **Minimizing Cost:** By forcing users to share an autoscaling cluster you have configured with maximum node count, rather than say, asking them to create a new one for their use each time they log in, you can control the total cost easily. The max cost of shared cluster can be calculated by assuming it is running X hours at maximum size with the particular VMs. It is difficult to achieve this if each user is given free reign over creating clusters of arbitrary size and VMs. 365 | 366 | 2. **Optimizing for Latency:** Only High Concurrency clusters have features which allow queries from different users share cluster resources in a fair, secure manner. HC clusters come with Query Watchdog, a process which keeps disruptive queries in check by automatically pre-empting rogue queries, limiting the maximum size of output rows returned, etc. 367 | 368 | 3. **Security:** Table Access control feature is only available in High Concurrency mode and needs to be turned on so that users can limit access to their database objects (tables, views, functions, etc.) created on the shared cluster. In case of ADLS, we recommend restricting access using the AAD Credential Passthrough feature instead of Table Access Controls. 369 | 370 | > ***If you’re using ADLS, we recommend AAD Credential Passthrough instead of Table Access Control for easy manageability.*** 371 | 372 | ![Figure 5: Interactive clusters](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Figure5.PNG "Figure 5: Interactive clusters") 373 | 374 | *Figure 5: Interactive clusters* 375 | 376 | ## Support Batch ETL Workloads With Single User Ephemeral Standard Clusters 377 | *Impact: Medium* 378 | 379 | Unlike Interactive workloads, logic in batch Jobs is well defined and their cluster resource requirements are known *a priori*. Hence to minimize cost, there’s no reason to follow the shared cluster model and we 380 | recommend letting each job create a separate cluster for its execution. Thus, instead of submitting batch ETL jobs to a cluster already created from ADB’s UI, submit them using the Jobs APIs. These APIs automatically create new clusters to run Jobs and also terminate them after running it. We call this the **Ephemeral Job Cluster** pattern for running jobs because the clusters short life is tied to the job lifecycle. 381 | 382 | Azure Data Factory uses this pattern as well - each job ends up creating a separate cluster since the underlying call is made using the [Runs-Submit Jobs API](https://docs.azuredatabricks.net/api/latest/jobs.html#runs-submit). 383 | 384 | 385 | ![Figure 6: Ephemeral Job cluster](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Figure6.PNG "Figure 6: Ephemeral Job cluster") 386 | 387 | *Figure 6: Ephemeral Job cluster* 388 | 389 | Just like the previous recommendation, this pattern will achieve general goals of minimizing cost, improving the target metric (throughput), and enhancing security by: 390 | 391 | 1. **Enhanced Security:** ephemeral clusters run only one job at a time, so each executor’s JVM runs code from only one user. This makes ephemeral clusters more secure than shared clusters for Java and Scala code. 392 | 2. **Lower Cost:** if you run jobs on a cluster created from ADB’s UI, you will be charged at the higher Interactive DBU rate. The lower Data Engineering DBUs are only available when the lifecycle of job and cluster are same. This is only achievable using the Jobs APIs to launch jobs on ephemeral clusters. 393 | 3. **Better Throughput:** cluster’s resources are dedicated to one job only, making the job finish faster than while running in a shared environment. 394 | 395 | For very short duration jobs (< 10 min) the cluster launch time (~ 7 min) adds a significant overhead to total execution time. Historically this forced users to run short jobs on existing clusters created by UI -- a 396 | costlier and less secure alternative. To fix this, ADB is coming out with a new feature called Instance Pools in Q3 2019 bringing down cluster launch time to 30 seconds or less. 397 | 398 | ## Favor Cluster Scoped Init Scripts over Global and Named scripts 399 | *Impact: High* 400 | 401 | [Init Scripts](https://docs.azuredatabricks.net/user-guide/clusters/init-scripts.html) provide a way to configure cluster’s nodes and can be used in the following modes: 402 | 403 | 1. **Global:** by placing the init script in `/databricks/init` folder, you force the script’s execution every time any cluster is created or restarted by users of the workspace. 404 | 2. **Cluster Named (deprecated):** you can limit the init script to run only on for a specific cluster’s creation and restarts by placing it in `/databricks/init/` folder. 405 | 3. **Cluster Scoped:** in this mode the init script is not tied to any cluster by its name and its automatic execution is not a virtue of its dbfs location. Rather, you specify the script in cluster’s configuration by either writing it directly in the cluster configuration UI or storing it on DBFS and specifying the path in [Cluster Create API](https://docs.azuredatabricks.net/user-guide/clusters/init-scripts.html#cluster-scoped-init-script). Any location under DBFS `/databricks` folder except `/databricks/init` can be used for this purpose, such as: `/databricks//set-env-var.sh` 406 | 407 | You should treat Init scripts with *extreme* caution because they can easily lead to intractable cluster launch failures. If you really need them, please use the Cluster Scoped execution mode as much as possible because: 408 | 409 | 1. ADB executes the script’s body in each cluster node. Thus, a successful cluster launch and subsequent operation is predicated on all nodal init scripts executing in a timely manner without any errors and reporting a zero exit code. This process is highly error prone, especially for scripts downloading artifacts from an external service over unreliable and/or misconfigured networks. 410 | 2. Because Global and Cluster Named init scripts execute automatically due to their placement in a special DBFS location, it is easy to overlook that they could be causing a cluster to not launch. By specifying the Init script in the Configuration, there’s a higher chance that you’ll consider them while debugging launch failures. 411 | 412 | ## Use Cluster Log Delivery Feature to Manage Logs 413 | *Impact: Medium* 414 | 415 | By default, Cluster logs are sent to default DBFS but you should consider sending the logs to a blob store location under your control using the [Cluster Log Delivery](https://docs.azuredatabricks.net/user-guide/clusters/log-delivery.html#cluster-log-delivery) feature. The Cluster Logs contain logs emitted by user code, as well as Spark framework’s Driver and Executor logs. Sending them to a blob store controlled by yourself is recommended over default DBFS location because: 416 | 1. ADB’s automatic 30-day default DBFS log purging policy might be too short for certain compliance scenarios. A blob store loction in your subscription will be free from such policies. 417 | 2. You can ship logs to other tools only if they are present in your storage account and a resource group governed by you. The root DBFS, although present in your subscription, is launched inside a Microsoft Azure managed resource group and is protected by a read lock. Because of this lock the logs are only accessible by privileged Azure Databricks framework code. However, constructing a pipeline to ship the logs to downstream log analytics tools requires logs to be in a lock-free location first. 418 | 419 | ## Choose Cluster VMs to Match Workload Class 420 | *Impact: High* 421 | 422 | To allocate the right amount and type of cluster resource for a job, we need to understand how different types of jobs demand different types of cluster resources. 423 | 424 | * **Machine Learning** - To train machine learning models it’s usually required cache all of the data in memory. Consider using memory optimized VMs so that the cluster can take advantage of the RAM cache. You can also use storage optimized instances for very large datasets. To size the cluster, take a % of the data set → cache it → see how much memory it 425 | used → extrapolate that to the rest of the data. 426 | 427 | * **Streaming** - You need to make sure that the processing rate is just above the input rate at peak times of the day. Depending peak input rate times, consider compute optimized VMs for the cluster to make sure processing rate is higher than your input rate. 428 | 429 | * **ETL** - In this case, data size and deciding how fast a job needs to be will be a leading indicator. Spark doesn’t always require data to be loaded into memory in order to execute transformations, but you’ll at the very least need to see how large the task sizes are on shuffles and compare that to the task throughput you’d like. To analyze the performance of these jobs start with basics and check if the job is by CPU, network, or local I/O, and go from there. Consider using a general purpose VM for these jobs. 430 | * **Interactive / Development Workloads** - The ability for a cluster to auto scale is most important for these types of jobs. In this case taking advantage of the [Autoscaling feature](https://docs.azuredatabricks.net/user-guide/clusters/sizing.html#cluster-size-and-autoscaling) will be your best friend in managing the cost of the infrastructure. 431 | 432 | ## Arrive at Correct Cluster Size by Iterative Performance Testing 433 | *Impact: High* 434 | 435 | It is impossible to predict the correct cluster size without developing the application because Spark and Azure Databricks use numerous techniques to improve cluster utilization. The broad approach you should follow for sizing is: 436 | 437 | 1. Develop on a medium sized cluster of 2-8 nodes, with VMs matched to workload class as explained earlier. 438 | 2. After meeting functional requirements, run end to end test on larger representative data while measuring CPU, memory and I/O used by the cluster at an aggregate level. 439 | 3. Optimize cluster to remove bottlenecks found in step 2 440 | - **CPU bound**: add more cores by adding more nodes 441 | - **Network bound**: use fewer, bigger SSD backed machines to reduce network size and improve remote read performance 442 | - **Disk I/O bound**: if jobs are spilling to disk, use VMs with more memory. 443 | 444 | Repeat steps 2 and 3 by adding nodes and/or evaluating different VMs until all obvious bottlenecks have been addressed. 445 | 446 | Performing these steps will help you to arrive at a baseline cluster size which can meet SLA on a subset of data. In theory, Spark jobs, like jobs on other Data Intensive frameworks (Hadoop) exhibit linear scaling. For example, if it takes 5 nodes to meet SLA on a 100TB dataset, and the production data is around 1PB, then prod cluster is likely going to be around 50 nodes in size. You can use this back of the envelope calculation as a first guess to do capacity planning. However, there are scenarios where Spark jobs don’t scale linearly. In some cases this is due to large amounts of shuffle adding an exponential synchronization cost (explained next), but there could be other reasons as well. Hence, to refine the first estimate and arrive at a more accurate node count we recommend repeating this process 3-4 times on increasingly larger data set sizes, say 5%, 10%, 15%, 30%, etc. The overall accuracy of the process depends on how closely the test data matches the live workload both in type and size. 447 | 448 | ## Tune Shuffle for Optimal Performance 449 | *Impact: High* 450 | 451 | A shuffle occurs when we need to move data from one node to another in order to complete a stage. Depending on the type of transformation you are doing you may cause a shuffle to occur. This happens when all the executors require seeing all of the data in order to accurately perform the action. If the Job requires a wide transformation, you can expect the job to execute slower because all of the partitions need to be shuffled around in order to complete the job. Eg: Group by, Distinct. 452 | 453 |

454 | 455 |

456 | 457 | *Figure 7: Shuffle vs. no-shuffle* 458 | 459 | 460 | You’ve got two control knobs of a shuffle you can use to optimize 461 | * The number of partitions being shuffled: 462 | ![SparkSnippet](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/SparkSnippet.PNG "SparkSnippet") 463 | * The amount of partitions that you can compute in parallel. 464 | + This is equal to the number of cores in a cluster. 465 | 466 | These two determine the partition size, which we recommend should be in the Megabytes to 1 Gigabyte range. If your shuffle partitions are too small, you may be unnecessarily adding more tasks to the stage. But if they are too big, you may get bottlenecked by the network. 467 | 468 | ## Partition Your Data 469 | *Impact: High* 470 | 471 | This is a broad Big Data best practice not limited to Azure Databricks, and we mention it here because it can notably impact the performance of Databricks jobs. Storing data in partitions allows you to take advantage of partition pruning and data skipping, two very important features which can avoid unnecessary data reads. Most of the time partitions will be 472 | on a date field but you should choose your partitioning field based on the predicates most often used by your queries. For example, if you’re always going to be filtering based on “Region,” then consider partitioning your data by region. 473 | * Evenly distributed data across all partitions (date is the most common) 474 | * 10s of GB per partition (~10 to ~50GB) 475 | * Small data sets should not be partitioned 476 | * Beware of over partitioning 477 | 478 | 479 | # Running ADB Applications Smoothly: Guidelines on Observability and Monitoring 480 | 481 | > ***“Every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can.”*** 482 | Jamie Zawinski 483 | 484 | By now we have covered planning for ADB deployments, provisioning Workspaces, selecting clusters, and deploying your applications on them. Now, let's talk about how to to monitor your Azure Databricks apps. These apps are rarely executed in isolation and need to be monitored 485 | along with a set of other services. Monitoring falls into four broad areas: 486 | 487 | 1. Resource utilization (CPU/Memory/Network) across an Azure Databricks cluster. This is referred to as VM metrics. 488 | 2. Spark metrics which enables monitoring of Spark applications to help uncover bottlenecks 489 | 3. Spark application logs which enables administrators/developers to query the logs, debug issues and investigate job run failures. This is specifically helpful to also understand exceptions across your workloads. 490 | 4. Application instrumentation which is native instrumentation that you add to your application for custom troubleshooting 491 | 492 | For the purposes of this version of the document we will focus on (1). This is the most common ask from customers. 493 | 494 | ## Collect resource utilization metrics across Azure Databricks cluster in a Log Analytics workspace 495 | *Impact: Medium* 496 | 497 | An important facet of monitoring is understanding the resource utilization in Azure Databricks clusters. You can also extend this to understanding utilization across all clusters in a workspace. This information is useful in arriving at the correct cluster and VM sizes. Each VM does have a set of limits (cores/disk throughput/network throughput) which play an important role in determining the performance profile of an Azure Databricks job. 498 | In order to get utilization metrics of an Azure Databricks cluster, you can stream the VM's metrics to an Azure Log Analytics Workspace (see Appendix A) by installing the Log Analytics Agent on each cluster node. Note: This could increase your cluster 499 | startup time by a few minutes. 500 | 501 | 502 | ### Querying VM metrics in Log Analytics once you have started the collection using the above document 503 | 504 | You can use Log analytics directly to query the Perf data. Here is an example of a query which charts out CPU for the VM’s in question for a specific cluster ID. See log analytics overview for further documentation on log analytics and query syntax. 505 | 506 | ```python 507 | %python 508 | script = """ 509 | sed -i "s/^exit 101$/exit 0/" /usr/sbin/policy-rc.d 510 | wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w $YOUR_ID -s $YOUR_KEY -d opinsights.azure.com 511 | """ 512 | 513 | #save script to databricks file system so it can be loaded by VMs 514 | dbutils.fs.put("/databricks/log_init_scripts/configure-omsagent.sh", script, True) 515 | ``` 516 | 517 | ![Grafana](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Grafana.PNG "Grafana") 518 | 519 | You can also use Grafana to visualize your data from Log Analytics. 520 | 521 | ## Cost Management, Chargeback and Analysis 522 | 523 | This section will focus on Azure Databricks billing, tools to manage and analyze cost and how to charge back to the team. 524 | 525 | ### Azure Databricks Billing 526 | First, it is important to understand the different workloads and tiers available with Azure Databricks. Azure Databricks is available in 2 tiers – Standard and Premium. Premium Tier offers additional features on top of what is available in Standard tier. These include Role-based access control for notebooks, jobs, and tables, Audit logs, Azure AD conditional pass-through, conditional authentication and many more. Please refer to [Azure Databricks pricing](https://azure.microsoft.com/en-us/pricing/details/databricks/) for the complete list. 527 | 528 | Both Premium and Standard tier come with 3 types of workload 529 | * Jobs Compute (previously called Data Engineering) 530 | * Jobs Light Compute (previously called Data Engineering Light) 531 | * All-purpose Compute (previously called Data Analytics) 532 | The Jobs Compute and Jobs Light Compute make it easy for data engineers to build and execute jobs, and All-purpose make it easy for data scientists to explore, visualize, manipulate, and share data and insights interactively. Depending upon the use-case, one can also use All-purpose Compute for data engineering or automated scenarios especially if the incoming job rate is higher. 533 | 534 | When you create an Azure Databricks workspace and spin up a cluster, below resources are consumed 535 | * DBUs – A DBU is a unit of processing capability, billed on a per-second usage 536 | * Virtual Machines – These represent your Databricks clusters that run the Databricks Runtime 537 | * Public IP Addresses – These represent the IP Addresses consumed by the Virtual Machines when the cluster is running 538 | * Blob Storage – Each workspace comes with a default storage 539 | * Managed Disk 540 | * Bandwidth – Bandwidth charges for any data transfer 541 | 542 | | Service or Resource | Pricing | 543 | | --- | --- | 544 | | DBUs |[DBU pricing](https://azure.microsoft.com/en-us/pricing/details/databricks/) | 545 | | VMs |[VM pricing](https://azure.microsoft.com/en-us/pricing/details/databricks/) | 546 | | Public IP Addresses |[Public IP Addresses pricing](https://azure.microsoft.com/en-us/pricing/details/ip-addresses/) | 547 | | Blob Storage |[Blob Storage pricing](https://azure.microsoft.com/en-us/pricing/details/storage/) | 548 | | Managed Disk |[Managed Disk pricing](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) | 549 | | Bandwidth |[Bandwidth pricing](https://azure.microsoft.com/en-us/pricing/details/bandwidth/) | 550 | 551 | In addition, if you use additional services as part of your end-2-end solution, such as Azure CosmosDB, or Azure Event Hub, then they are charged per their pricing plan. 552 | 553 | There are 2 pricing plans for Azure Databricks DBUs: 554 | 555 | 1. Pay as you go – Pay for the DBUs as you use: Refer to the pricing page for the DBU prices based on the SKU. The DBU per hour price for different SKUs differs across Azure public cloud, Azure Gov and Azure China region. 556 | 557 | 2. Pre-purchase or Reservations – You can get up to 37% savings over pay-as-you-go DBU when you pre-purchase Azure Databricks Units (DBU) as Databricks Commit Units (DBCU) for either 1 or 3 years. A Databricks Commit Unit (DBCU) normalizes usage from Azure Databricks workloads and tiers into to a single purchase. Your DBU usage across those workloads and tiers will draw down from the Databricks Commit Units (DBCU) until they are exhausted, or the purchase term expires. The draw down rate will be equivalent to the price of the DBU, as per the table above. 558 | 559 | Since, you are also billed for the VMs, you have both the above options for VMs as well: 560 | 561 | 1. Pay as you go 562 | 2. [Reservations](https://azure.microsoft.com/en-us/pricing/reserved-vm-instances/) 563 | 564 | Please see few examples of a billing for Azure Databricks with Pay as you go: 565 | 566 | Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute, Jobs Light Compute, or All-purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-purpose Compute workload. 567 | 568 | Accordingly, the pricing will be dependent on below components 569 | 1. DBU SKU – DBU price based on the workload and tier 570 | 2. VM SKU – VM price based on the VM SKU 571 | 3. DBU Count – Each VM SKU has an associated DBU count. Example – D3v2 has DBU count of 0.75 572 | 4. Region 573 | 5. Duration 574 | 575 | #### Example 1: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for All-purpose Compute: 576 | * VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 577 | * DBU cost for All-purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.55/DBU = $1,100 578 | * The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698. 579 | 580 | #### Example 2: If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 instances, the billing would be the following for Jobs Compute workload: 581 | * VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598 582 | * DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2 DBU per node x $0.30/DBU = $600 583 | * The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198. 584 | 585 | In addition to VM and DBU charges, there will be additional charges for managed disks, public IP address, bandwidth, or any other resource such as Azure Storage, Azure Cosmos DB depending on your application. 586 | 587 | #### Azure Databricks Trial 588 | If you are new to Azure Databricks, you can also use a Trial SKU that gives you free DBUs for Premium tier for 14 days. You will still need to pay for other resources like VM, Storage etc. that are consumed during this period. After the trial is over, you will need to start paying for the DBUs. 589 | 590 | ### Chargeback Scenarios 591 | 592 | There are 2 broad scenarios we have seen with respect to chargeback internal teams for sharing Databricks resources 593 | 1. Chargeback across a single Azure Databricks workspace: In this case, a single workspace is shared across multiple teams and user would like to chargeback the individual teams. Individual teams would use their own Databricks cluster and can be charged back at cluster level. 594 | 595 | 2. Chargeback across multiple Databricks workspace: In this case, teams use their own workspace and would like to chargeback at workspace level. 596 | To support these scenarios, Azure Databricks leverages Azure Tags so that the users can view the cost/usage for resources with tags. There are default tags that comes with the. 597 | 598 | Please see below the default tags that are available with the resources: 599 | | Resources | Default Tags | 600 | | --- | --- | 601 | | All-purpose Compute |Vendor, Creator, ClusterName, ClusterId| 602 | | Jobs Compute or Jobs Light Compute |Vendor, Creator, ClusterName, ClusterId, RunName, JobId | 603 | | Pool |Vendor, DatabricksInstancePoolId,DatabricksInstancePoolCreatorId | 604 | | Resources created during workspace creation (Storage, Worker VNet, NSG) |application, databricks-environment | 605 | 606 | In addition to the default tags, customers can add custom tags to the resources based on how they want to charge back. Both default and custom tags are displayed on Azure bills that allows one to chargeback by filtering resource usage based on tags. 607 | 608 | 1. [Cluster Tags](https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-tags): You can create custom tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to underlying cluster resources – VMs, DBUs, Public IP Addresses, Disks. 609 | 610 | 2. [Pool Tags](https://docs.microsoft.com/en-us/azure/databricks/clusters/instance-pools/configure#--pool-tags): You can create custom tags as key-value pairs when you create a pool, and Azure Databricks applies these tags to underlying pool resources – VMs, Public IP Addresses, Disks. Pool-backed clusters inherit default and custom tags from the pool configuration. 611 | 612 | 3. [Workspace Tags](https://docs.microsoft.com/en-us/azure/databricks/administration-guide/account-settings/usage-detail-tags-azure): You can create custom tags as key-value pairs when you create an Azure Databricks workspaces. These tags apply to underlying resources within the workspace – VMs, DBUs, and others. 613 | 614 | Please see below on how tags propagate for DBUs and VMs 615 | 616 | 1. Clusters created from pools 617 | * DBU Tag = Workspace Tag + Pool Tag + Cluster Tag 618 | * VM Tag = Workspace Tag + Pool Tag 619 | 620 | 2. Clusters not from Pools 621 | * DBU Tag = Workspace Tag + Cluster Tag 622 | * VM Tag = Workspace Tag + Cluster Tag 623 | 624 | These tags (default and custom) propagate to [Cost Analysis Reports](https://docs.microsoft.com/en-us/azure/cost-management-billing/costs/quick-acm-cost-analysis) that you can access in the Azure Portal. The below section will explain how to do cost/usage analysis using these tags. 625 | 626 | ### Cost/Usage Analysis 627 | The Cost Analysis report is available under Cost Management within Azure Portal. Please refer to [Cost Management](https://docs.microsoft.com/en-us/azure/cost-management-billing/costs/quick-acm-cost-analysis)section to get a detailed overview on how to use Cost Management. 628 | 629 | ![Cost Management](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management.png "Cost Management") 630 | 631 | Below example is aimed at giving a quick start to get you going to do cost analysis for Azure Databricks. Below are the steps: 632 | 633 | 1. In Azure Portal, click on Cost Management + Billing 634 | 2. In Cost Management, click on Cost Analysis Tab 635 | 636 | ![Cost Management config](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20config.png "Cost Management config") 637 | 638 | 3. Choose the right billing scope that want report for and make sure the user has Cost Management Reader permission for the that scope. 639 | 4. Once selected, then you will see cost reports for all the Azure resources at that scope. 640 | 5. Post that you can create different reports by using the different options on the chart. For example, one of the reports you can create is 641 | 642 | * Chart option as Column (stacked) 643 | * Granularity – Daily 644 | * Group by – Tag – Choose clustername or clustered 645 | 646 | You will see something like below where it will show the distribution of cost on a daily basis for different clusters in your subscription or the scope that you chose in Step 3. You also have option to save this report and share it with your team. 647 | 648 | ![Cost Management config](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20Report.png "Cost Management report") 649 | 650 | To chargeback, you can filter this report by using the tag option. For example, you can use default tag: Creator or can use own custom tag – Cost Center and chargeback based on that. 651 | 652 | ![Cost Management filter](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20filter.png "Cost Management filter") 653 | 654 | You also have option to consume this data from CSV or a native Power BI connector for Cost Management. Please see below: 655 | 656 | 1. To download this data to CSV, you can set export from Cost Management + Billing -> Usage + Charges and choose Usage Details Version 2 on the right. Refer [this](https://docs.microsoft.com/en-us/azure/cost-management-billing/reservations/understand-reserved-instance-usage-ea#download-the-usage-csv-file-with-new-data) for more details. Once downloaded, you can view the cost usage data and filter based on tags to chargeback. In the CSV, you can refer the Meter Name to get the Databricks workload consumed. In addition, this is how the other fields are represented for meters related to Azure Databricks. 657 | 658 | * Quantity = Number of Virtual Machines x Number of hours x DBU count 659 | * Effective Price = DBU price based on the SKU 660 | * Cost = Quantity x Effective Price 661 | 662 | ![Cost Management export](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20export.png "Cost Management export") 663 | 664 | 2. There is a native [Cost Management Connector](https://docs.microsoft.com/en-us/power-bi/connect-data/desktop-connect-azure-cost-management) in Power BI that allows one to make powerful, customized visualization and cost/usage reports. 665 | 666 | ![Cost Management connector](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20connector.png "Cost Management connector") 667 | 668 | Once you connect, you can create various rich reports easily like below by choosing the right fields from the table. 669 | 670 | Tip: To filter on tags, you will need to parse the json in Power BI. To do that, follow these steps: 671 | 672 | 1. Go to "Query Editor" 673 | 2. Select the "Usage Details" table 674 | 3. On the right side the "Properties" tab shows the steps as 675 | 676 | ![Cost Management config](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20PBI%20config.png "Cost Management config") 677 | 678 | 4. From the menu bar go to "Add column" -> "Add custom column" 679 | 5. Name the column and enter the following text in the query = "{"& [Tags] & "}" 680 | 681 | ![Cost Management config2](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20PBI%20config2.jpg "Cost Management config2") 682 | 683 | 6. This will create a new column of "tags" in the json format. 684 | 7. Now user can transform it as expand it. You can then use the different tags as columns that you can use in a report. 685 | 686 | Please see some of the common views created easily using this connector. 687 | 688 | * Cost Report breakdown by Resource Group, Tags, MeterName 689 | ![Cost Management report1](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20PBI%20report%201.png "Cost Management report1") 690 | 691 | * Cost Report breakdown by Cluster, and custom tags 692 | ![Cost Management report2](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20PBI%20report2.PNG "Cost Management report2") 693 | 694 | * Cost Report breakdown by Cluster and Metername in pie chart 695 | ![Cost Management report3](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20PBI%20report3.PNG "Cost Management report3") 696 | 697 | * Cost Report breakdown by with resource group and cluster including quantity 698 | ![Cost Management report4](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Cost%20Management%20PBI%20report4.PNG "Cost Management report4") 699 | 700 | ### Pricing difference for Regions 701 | 702 | Please refer to [Azure Databricks pricing page](https://azure.microsoft.com/en-us/pricing/details/databricks/) to get the pricing for DBU SKU and pricing discount based on Reservations. There are certain differences to consider 703 | 1.The DBU prices are different for Azure public cloud and other regions such as Azure Gov 704 | 2.The pre-purchase plan prices are different for Azure public cloud and Azure Gov 705 | 706 | ### Known Issues/Limitations 707 | 708 | 1. Tag change propagation at workspace level takes up to ~1 hour to apply to resources under Managed resource group. 709 | 2. Tag change propagation at workspace level requires cluster restart for existing running cluster, or pool expansion 710 | 3. Cost Management at parent resource group won’t show Managed RG resources consumption 711 | 4. Cost Management role assignments are not possible at Managed RG level. User today must have role assignment at parent resource group level or above (i.e. subscription) to show managed RG consumption 712 | 5. For clusters created from pool, only workspace tags and pool tags are propagated to the VMs 713 | 6. Tag keys and values can contain only characters from ISO 8859-1 set 714 | 7. Custom tag gets prefixed with x_ when it conflicts with default tag 715 | 8. Max of 50 tags can be assigned to Azure resource 716 | 717 | # Appendix A 718 | 719 | ## Installation for being able to capture VM metrics in Log Analytics 720 | 721 | 722 | 723 | #### Step 1 - Create a Log Analytics Workspace 724 | Please follow the instructions [here](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#create-a-workspace) to create a Log Analytics workspace 725 | 726 | #### Step 2- Get Log Analytics Workspace Credentials 727 | Get the workspace id and key using instructions [here.](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#obtain-workspace-id-and-key) 728 | 729 | Store these in Azure Key Vault-based Secrets backend 730 | 731 | #### Step 3 - Configure Data Collection in Log Analytics Workspace 732 | Please follow the instructions [here.](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#collect-event-and-performance-data) 733 | 734 | #### Step 4 - Configure the Init Script 735 | Replace the *LOG_ANALYTICS_WORKSPACE_ID* and *LOG_ANALYTICS_WORKSPACE_KEY* with your own info. 736 | 737 | ![PythonSnippet](https://github.com/Azure/AzureDatabricksBestPractices/blob/master/Python%20Snippet.PNG "PythonSnippet") 738 | 739 | Now it could be used as a global script with all clusters (change the path to /databricks/init in that case), or as a cluster-scoped script with specific ones. We recommend using cluster scoped scripts as explained in this doc earlier. 740 | 741 | #### Step 5 - View Collected Data via Azure Portal 742 | See [this](https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer#view-data-collected) document. 743 | 744 | #### References 745 | * https://docs.microsoft.com/en-us/azure/azure-monitor/learn/quick-collect-linux-computer 746 | * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/OMS-Agent-for-Linux.md 747 | * https://github.com/Microsoft/OMS-Agent-for-Linux/blob/master/docs/Troubleshooting.md 748 | 749 | ## Access patterns with Azure Data Lake Storage Gen2 750 | To understand the various access patterns and approaches to securing data in ADLS see the [following guidance](https://github.com/hurtn/datalake-ADLS-access-patterns-with-Databricks/blob/master/readme.md). 751 | 752 | --------------------------------------------------------------------------------