└── README.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Mazur's SQL Style Guide 3 | 4 | Howdy! I'm [Matt Mazur](https://mattmazur.com/) and I'm a data analyst who has worked at several startups to help them use data to grow their businesses. This guide is an attempt to document my preferences for formatting SQL in the hope that it may be of some use to others. If you or your team do not already have a SQL style guide, this may serve as a good starting point which you can adopt and update based on your preferences. 5 | 6 | Also, I'm a strong believer in having [Strong Opinions, Weakly Held](https://medium.com/@ameet/strong-opinions-weakly-held-a-framework-for-thinking-6530d417e364) so if you disagree with any of this, [drop me a note](https://mattmazur.com/contact/), I'd love to discuss it. 7 | 8 | You may also enjoy my [LookML Style Guide](https://github.com/mattm/lookml-style-guide) or my [blog](https://mattmazur.com/category/analytics/) where I write about analytics and data analysis. 9 | 10 | Also, if you're interested in continuing to level up your technical skills, **check out [Emergent Mind](https://www.emergentmind.com/?utm_source=sqlstyle)**, an AI research asssistant I'm working on helps you discover and learn about cutting-edge computer science research. 11 | 12 | Simplified Chinese version here: [中文版](https://github.com/huangxinping/sql-style-guide/blob/zh-cn/README.md) 13 | 14 | ## Example 15 | 16 | Here's a non-trivial query to give you an idea of what this style guide looks like in the practice: 17 | 18 | ```sql 19 | with hubspot_interest as ( 20 | 21 | select 22 | email, 23 | timestamp_millis(property_beacon_interest) as expressed_interest_at 24 | from hubspot.contact 25 | where 26 | property_beacon_interest is not null 27 | 28 | ), 29 | 30 | support_interest as ( 31 | 32 | select 33 | conversation.email, 34 | conversation.created_at as expressed_interest_at 35 | from helpscout.conversation 36 | inner join helpscout.conversation_tag on conversation.id = conversation_tag.conversation_id 37 | where 38 | conversation_tag.tag = 'beacon-interest' 39 | 40 | ), 41 | 42 | combined_interest as ( 43 | 44 | select * from hubspot_interest 45 | union all 46 | select * from support_interest 47 | 48 | ), 49 | 50 | first_interest as ( 51 | 52 | select 53 | email, 54 | min(expressed_interest_at) as expressed_interest_at 55 | from combined_interest 56 | group by email 57 | 58 | ) 59 | 60 | select * from first_interest 61 | ``` 62 | ## Guidelines 63 | 64 | ### Use lowercase SQL 65 | 66 | It's just as readable as uppercase SQL and you won't have to constantly be holding down a shift key. 67 | 68 | ```sql 69 | -- Good 70 | select * from users 71 | 72 | -- Bad 73 | SELECT * FROM users 74 | 75 | -- Bad 76 | Select * From users 77 | ``` 78 | 79 | ### Put each selected column on its own line 80 | 81 | When selecting columns, always put each column name on its own line and never on the same line as `select`. For multiple columns, it's easier to read when each column is on its own line. And for single columns, it's easier to add additional columns without any reformatting (which you would have to do if the single column name was on the same line as the `select`). 82 | 83 | ```sql 84 | -- Good 85 | select 86 | id 87 | from users 88 | 89 | -- Good 90 | select 91 | id, 92 | email 93 | from users 94 | 95 | -- Bad 96 | select id 97 | from users 98 | 99 | -- Bad 100 | select id, email 101 | from users 102 | ``` 103 | 104 | ## select * 105 | 106 | When selecting `*` it's fine to include the `*` next to the `select` and also fine to include the `from` on the same line, assuming no additional complexity like `where` conditions: 107 | 108 | ```sql 109 | -- Good 110 | select * from users 111 | 112 | -- Good too 113 | select * 114 | from users 115 | 116 | -- Bad 117 | select * from users where email = 'name@example.com' 118 | ``` 119 | 120 | ## Indenting where conditions 121 | 122 | Similarly, conditions should always be spread across multiple lines to maximize readability and make them easier to add to. Operators should be placed at the end of each line: 123 | 124 | ```sql 125 | -- Good 126 | select * 127 | from users 128 | where 129 | email = 'example@domain.com' 130 | 131 | -- Good 132 | select * 133 | from users 134 | where 135 | email like '%@domain.com' and 136 | created_at >= '2021-10-08' 137 | 138 | -- Bad 139 | select * 140 | from users 141 | where email = 'example@domain.com' 142 | 143 | -- Bad 144 | select * 145 | from users 146 | where 147 | email like '%@domain.com' and created_at >= '2021-10-08' 148 | 149 | -- Bad 150 | select * 151 | from users 152 | where 153 | email like '%@domain.com' 154 | and created_at >= '2021-10-08' 155 | ``` 156 | 157 | ### Left align SQL keywords 158 | 159 | Some IDEs have the ability to automatically format SQL so that the spaces after the SQL keywords are vertically aligned. This is cumbersome to do by hand (and in my opinion harder to read anyway) so I recommend just left aligning all of the keywords: 160 | 161 | ```sql 162 | -- Good 163 | select 164 | id, 165 | email 166 | from users 167 | where 168 | email like '%@gmail.com' 169 | 170 | -- Bad 171 | select id, email 172 | from users 173 | where email like '%@gmail.com' 174 | ``` 175 | 176 | ### Use single quotes 177 | 178 | Some SQL dialects like BigQuery support using double quotes, but for most dialects double quotes will wind up referring to column names. For that reason, single quotes are preferable: 179 | 180 | ```sql 181 | -- Good 182 | select * 183 | from users 184 | where 185 | email = 'example@domain.com' 186 | 187 | -- Bad 188 | select * 189 | from users 190 | where 191 | email = "example@domain.com" 192 | ``` 193 | 194 | If your SQL dialect supports double quoted strings and you prefer them, just make sure to be consistent and not switch between single and double quotes. 195 | 196 | ### Use `!=` over `<>` 197 | 198 | Simply because `!=` reads like "not equal" which is closer to how we'd say it out loud. 199 | 200 | ```sql 201 | -- Good 202 | select 203 | count(*) as paying_users_count 204 | from users 205 | where 206 | plan_name != 'free' 207 | ``` 208 | 209 | ### Commas should be at the the end of lines 210 | 211 | ```sql 212 | -- Good 213 | select 214 | id, 215 | email 216 | from users 217 | 218 | -- Bad 219 | select 220 | id 221 | , email 222 | from users 223 | ``` 224 | 225 | While the commas-first style does have some practical advantages (it's easier to spot missing commas and results in cleaner diffs), I'm just not a huge fan of how they look so prefer commas-last. 226 | 227 | ### Avoid spaces inside of parenthesis 228 | 229 | ```sql 230 | -- Good 231 | select * 232 | from users 233 | where 234 | id in (1, 2) 235 | 236 | -- Bad 237 | select * 238 | from users 239 | where 240 | id in ( 1, 2 ) 241 | ``` 242 | 243 | ### Break long lists of `in` values into multiple indented lines 244 | 245 | ```sql 246 | -- Good 247 | select * 248 | from users 249 | where 250 | email in ( 251 | 'user-1@example.com', 252 | 'user-2@example.com', 253 | 'user-3@example.com', 254 | 'user-4@example.com' 255 | ) 256 | ``` 257 | 258 | ### Table names should be a plural snake case of the noun 259 | 260 | ```sql 261 | -- Good 262 | select * 263 | from users 264 | 265 | -- Good 266 | select * 267 | from visit_logs 268 | 269 | -- Bad 270 | select * 271 | from user 272 | 273 | -- Bad 274 | select * 275 | from visitLog 276 | ``` 277 | 278 | ### Column names should be snake_case 279 | 280 | ```sql 281 | -- Good 282 | select 283 | id, 284 | email, 285 | timestamp_trunc(created_at, month) as signup_month 286 | from users 287 | 288 | -- Bad 289 | select 290 | id, 291 | email, 292 | timestamp_trunc(created_at, month) as SignupMonth 293 | from users 294 | ``` 295 | 296 | ### Column name conventions 297 | 298 | * Boolean fields should be prefixed with `is_`, `has_`, or `does_`. For example, `is_customer`, `has_unsubscribed`, etc. 299 | * Date-only fields should be suffixed with `_date`. For example, `report_date`. 300 | * Date+time fields should be suffixed with `_at`. For example, `created_at`, `posted_at`, etc. 301 | 302 | ### Column order conventions 303 | 304 | Put the primary key first, followed by foreign keys, then by all other columns. If the table has any system columns (`created_at`, `updated_at`, `is_deleted`, etc.), put those last. 305 | 306 | ```sql 307 | -- Good 308 | select 309 | id, 310 | name, 311 | created_at 312 | from users 313 | 314 | -- Bad 315 | select 316 | created_at, 317 | name, 318 | id, 319 | from users 320 | ``` 321 | 322 | ### Include `inner` for inner joins 323 | 324 | Better to be explicit so that the join type is crystal clear: 325 | 326 | ```sql 327 | -- Good 328 | select 329 | users.email, 330 | sum(charges.amount) as total_revenue 331 | from users 332 | inner join charges on users.id = charges.user_id 333 | 334 | -- Bad 335 | select 336 | users.email, 337 | sum(charges.amount) as total_revenue 338 | from users 339 | join charges on users.id = charges.user_id 340 | ``` 341 | 342 | ### For join conditions, put the table that was referenced first immediately after the `on` 343 | 344 | By doing it this way it makes it easier to determine if your join is going to cause the results to fan out: 345 | 346 | ```sql 347 | -- Good 348 | select 349 | ... 350 | from users 351 | left join charges on users.id = charges.user_id 352 | -- primary_key = foreign_key --> one-to-many --> fanout 353 | 354 | select 355 | ... 356 | from charges 357 | left join users on charges.user_id = users.id 358 | -- foreign_key = primary_key --> many-to-one --> no fanout 359 | 360 | -- Bad 361 | select 362 | ... 363 | from users 364 | left join charges on charges.user_id = users.id 365 | ``` 366 | 367 | ### Single join conditions should be on the same line as the join 368 | 369 | ```sql 370 | -- Good 371 | select 372 | users.email, 373 | sum(charges.amount) as total_revenue 374 | from users 375 | inner join charges on users.id = charges.user_id 376 | group by email 377 | 378 | -- Bad 379 | select 380 | users.email, 381 | sum(charges.amount) as total_revenue 382 | from users 383 | inner join charges 384 | on users.id = charges.user_id 385 | group by email 386 | ``` 387 | 388 | When you have mutliple join conditions, place each one on their own indented line: 389 | 390 | ```sql 391 | -- Good 392 | select 393 | users.email, 394 | sum(charges.amount) as total_revenue 395 | from users 396 | inner join charges on 397 | users.id = charges.user_id and 398 | refunded = false 399 | group by email 400 | ``` 401 | 402 | ### Avoid aliasing table names most of the time 403 | 404 | It can be tempting to abbreviate table names like `users` to `u` and `charges` to `c`, but it winds up making the SQL less readable: 405 | 406 | ```sql 407 | -- Good 408 | select 409 | users.email, 410 | sum(charges.amount) as total_revenue 411 | from users 412 | inner join charges on users.id = charges.user_id 413 | 414 | -- Bad 415 | select 416 | u.email, 417 | sum(c.amount) as total_revenue 418 | from users u 419 | inner join charges c on u.id = c.user_id 420 | ``` 421 | 422 | Most of the time you'll want to type out the full table name. 423 | 424 | There are two exceptions: 425 | 426 | If you you need to join to a table more than once in the same query and need to distinguish each version of it, aliases are necessary. 427 | 428 | Also, if you're working with long or ambiguous table names, it can be useful to alias them (but still use meaningful names): 429 | 430 | ```sql 431 | -- Good: Meaningful table aliases 432 | select 433 | companies.com_name, 434 | beacons.created_at 435 | from stg_mysql_helpscout__helpscout_companies companies 436 | inner join stg_mysql_helpscout__helpscout_beacons_v2 beacons on companies.com_id = beacons.com_id 437 | 438 | -- OK: No table aliases 439 | select 440 | stg_mysql_helpscout__helpscout_companies.com_name, 441 | stg_mysql_helpscout__helpscout_beacons_v2.created_at 442 | from stg_mysql_helpscout__helpscout_companies 443 | inner join stg_mysql_helpscout__helpscout_beacons_v2 on stg_mysql_helpscout__helpscout_companies.com_id = stg_mysql_helpscout__helpscout_beacons_v2.com_id 444 | 445 | -- Bad: Unclear table aliases 446 | select 447 | c.com_name, 448 | b.created_at 449 | from stg_mysql_helpscout__helpscout_companies c 450 | inner join stg_mysql_helpscout__helpscout_beacons_v2 b on c.com_id = b.com_id 451 | ``` 452 | 453 | ### Include the table when there is a join, but omit it otherwise 454 | 455 | When there are no join involved, there's no ambiguity around which table the columns came from so you can leave the table name out: 456 | 457 | ```sql 458 | -- Good 459 | select 460 | id, 461 | name 462 | from companies 463 | 464 | -- Bad 465 | select 466 | companies.id, 467 | companies.name 468 | from companies 469 | ``` 470 | 471 | But when there are joins involved, it's better to be explicit so it's clear where the columns originated: 472 | 473 | ```sql 474 | -- Good 475 | select 476 | users.email, 477 | sum(charges.amount) as total_revenue 478 | from users 479 | inner join charges on users.id = charges.user_id 480 | 481 | -- Bad 482 | select 483 | email, 484 | sum(amount) as total_revenue 485 | from users 486 | inner join charges on users.id = charges.user_id 487 | 488 | ``` 489 | 490 | ### Always rename aggregates and function-wrapped arguments 491 | 492 | ```sql 493 | -- Good 494 | select 495 | count(*) as total_users 496 | from users 497 | 498 | -- Bad 499 | select 500 | count(*) 501 | from users 502 | 503 | -- Good 504 | select 505 | timestamp_millis(property_beacon_interest) as expressed_interest_at 506 | from hubspot.contact 507 | where 508 | property_beacon_interest is not null 509 | 510 | -- Bad 511 | select 512 | timestamp_millis(property_beacon_interest) 513 | from hubspot.contact 514 | where 515 | property_beacon_interest is not null 516 | ``` 517 | 518 | ### Be explicit in boolean conditions 519 | 520 | ```sql 521 | -- Good 522 | select * 523 | from customers 524 | where 525 | is_cancelled = true 526 | 527 | -- Good 528 | select * 529 | from customers 530 | where 531 | is_cancelled = false 532 | 533 | -- Bad 534 | select * 535 | from customers 536 | where 537 | is_cancelled 538 | 539 | -- Bad 540 | select * 541 | from customers 542 | where 543 | not is_cancelled 544 | ``` 545 | 546 | ### Use `as` to alias column names 547 | 548 | ```sql 549 | -- Good 550 | select 551 | id, 552 | email, 553 | timestamp_trunc(created_at, month) as signup_month 554 | from users 555 | 556 | -- Bad 557 | select 558 | id, 559 | email, 560 | timestamp_trunc(created_at, month) signup_month 561 | from users 562 | ``` 563 | 564 | ### Group using column names or numbers, but not both 565 | 566 | I prefer grouping by name, but grouping by numbers is [also fine](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/). 567 | 568 | ```sql 569 | -- Good 570 | select 571 | user_id, 572 | count(*) as total_charges 573 | from charges 574 | group by user_id 575 | 576 | -- Good 577 | select 578 | user_id, 579 | count(*) as total_charges 580 | from charges 581 | group by 1 582 | 583 | -- Bad 584 | select 585 | timestamp_trunc(created_at, month) as signup_month, 586 | vertical, 587 | count(*) as users_count 588 | from users 589 | group by 1, vertical 590 | ``` 591 | 592 | ### Take advantage of lateral column aliasing when grouping by name 593 | 594 | ```sql 595 | -- Good 596 | select 597 | timestamp_trunc(com_created_at, year) as signup_year, 598 | count(*) as total_companies 599 | from companies 600 | group by signup_year 601 | 602 | -- Bad 603 | select 604 | timestamp_trunc(com_created_at, year) as signup_year, 605 | count(*) as total_companies 606 | from companies 607 | group by timestamp_trunc(com_created_at, year) 608 | ``` 609 | 610 | ### Grouping columns should go first 611 | 612 | ```sql 613 | -- Good 614 | select 615 | timestamp_trunc(com_created_at, year) as signup_year, 616 | count(*) as total_companies 617 | from companies 618 | group by signup_year 619 | 620 | -- Bad 621 | select 622 | count(*) as total_companies, 623 | timestamp_trunc(com_created_at, year) as signup_year 624 | from mysql_helpscout.helpscout_companies 625 | group by signup_year 626 | ``` 627 | 628 | ### Aligning case/when statements 629 | 630 | Each `when` should be on its own line (nothing on the `case` line) and should be indented one level deeper than the `case` line. The `then` can be on the same line or on its own line below it, just aim to be consistent. 631 | 632 | ```sql 633 | -- Good 634 | select 635 | case 636 | when event_name = 'viewed_homepage' then 'Homepage' 637 | when event_name = 'viewed_editor' then 'Editor' 638 | else 'Other' 639 | end as page_name 640 | from events 641 | 642 | -- Good too 643 | select 644 | case 645 | when event_name = 'viewed_homepage' 646 | then 'Homepage' 647 | when event_name = 'viewed_editor' 648 | then 'Editor' 649 | else 'Other' 650 | end as page_name 651 | from events 652 | 653 | -- Bad 654 | select 655 | case when event_name = 'viewed_homepage' then 'Homepage' 656 | when event_name = 'viewed_editor' then 'Editor' 657 | else 'Other' 658 | end as page_name 659 | from events 660 | ``` 661 | 662 | ### Use CTEs, not subqueries 663 | 664 | Avoid subqueries; CTEs will make your queries easier to read and reason about. 665 | 666 | When using CTEs, pad the query with new lines. 667 | 668 | If you use any CTEs, always `select *` from the last CTE at the end. That way you can quickly inspect the output of other CTEs used in the query to debug the results. 669 | 670 | Closing CTE parentheses should use the same indentation level as `with` and the CTE names. 671 | 672 | ```sql 673 | -- Good 674 | with ordered_details as ( 675 | 676 | select 677 | user_id, 678 | name, 679 | row_number() over (partition by user_id order by date_updated desc) as details_rank 680 | from billingdaddy.billing_stored_details 681 | 682 | ), 683 | 684 | first_updates as ( 685 | 686 | select 687 | user_id, 688 | name 689 | from ordered_details 690 | where 691 | details_rank = 1 692 | 693 | ) 694 | 695 | select * from first_updates 696 | 697 | -- Bad 698 | select 699 | user_id, 700 | name 701 | from ( 702 | select 703 | user_id, 704 | name, 705 | row_number() over (partition by user_id order by date_updated desc) as details_rank 706 | from billingdaddy.billing_stored_details 707 | ) ranked 708 | where 709 | details_rank = 1 710 | ``` 711 | 712 | ### Use meaningful CTE names 713 | 714 | ```sql 715 | -- Good 716 | with ordered_details as ( 717 | 718 | -- Bad 719 | with d1 as ( 720 | ``` 721 | 722 | ### Window functions 723 | 724 | Leave it all on its own line: 725 | 726 | ```sql 727 | -- Good 728 | select 729 | user_id, 730 | name, 731 | row_number() over (partition by user_id order by date_updated desc) as details_rank 732 | from billingdaddy.billing_stored_details 733 | 734 | -- Okay 735 | select 736 | user_id, 737 | name, 738 | row_number() over ( 739 | partition by user_id 740 | order by date_updated desc 741 | ) as details_rank 742 | from billingdaddy.billing_stored_details 743 | ``` 744 | 745 | ## Credits 746 | 747 | This style guide was inspired in part by: 748 | 749 | * [Fishtown Analytics' dbt Style Guide](https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#sql-style-guide) 750 | * [KickStarter's SQL Style Guide](https://gist.github.com/fredbenenson/7bb92718e19138c20591) 751 | * [GitLab's SQL Style Guide](https://about.gitlab.com/handbook/business-ops/data-team/sql-style-guide/) 752 | 753 | Hat-tip to Peter Butler, Dan Wyman, Simon Ouderkirk, Alex Cano, Adam Stone, Brian Kim, and Claire Carroll for providing feedback on this guide. 754 | --------------------------------------------------------------------------------