└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Mazur's SQL Style Guide
  3 | 
  4 | Howdy! I'm [Matt Mazur](https://mattmazur.com/) and I'm a data analyst who has worked at several startups to help them use data to grow their businesses. This guide is an attempt to document my preferences for formatting SQL in the hope that it may be of some use to others. If you or your team do not already have a SQL style guide, this may serve as a good starting point which you can adopt and update based on your preferences. 
  5 | 
  6 | Also, I'm a strong believer in having [Strong Opinions, Weakly Held](https://medium.com/@ameet/strong-opinions-weakly-held-a-framework-for-thinking-6530d417e364) so if you disagree with any of this, [drop me a note](https://mattmazur.com/contact/), I'd love to discuss it.
  7 | 
  8 | You may also enjoy my [LookML Style Guide](https://github.com/mattm/lookml-style-guide) or my [blog](https://mattmazur.com/category/analytics/) where I write about analytics and data analysis.
  9 | 
 10 | Also, if you're interested in continuing to level up your technical skills, **check out [Emergent Mind](https://www.emergentmind.com/?utm_source=sqlstyle)**, an AI research asssistant I'm working on helps you discover and learn about cutting-edge computer science research.
 11 | 
 12 | Simplified Chinese version here: [中文版](https://github.com/huangxinping/sql-style-guide/blob/zh-cn/README.md)
 13 | 
 14 | ## Example
 15 | 
 16 | Here's a non-trivial query to give you an idea of what this style guide looks like in the practice:
 17 | 
 18 | ```sql
 19 | with hubspot_interest as (
 20 | 
 21 |     select
 22 |         email,
 23 |         timestamp_millis(property_beacon_interest) as expressed_interest_at
 24 |     from hubspot.contact
 25 |     where 
 26 |         property_beacon_interest is not null
 27 | 
 28 | ), 
 29 | 
 30 | support_interest as (
 31 | 
 32 |     select 
 33 |         conversation.email,
 34 |         conversation.created_at as expressed_interest_at
 35 |     from helpscout.conversation
 36 |     inner join helpscout.conversation_tag on conversation.id = conversation_tag.conversation_id
 37 |     where 
 38 |         conversation_tag.tag = 'beacon-interest'
 39 | 
 40 | ), 
 41 | 
 42 | combined_interest as (
 43 | 
 44 |     select * from hubspot_interest
 45 |     union all
 46 |     select * from support_interest
 47 | 
 48 | ),
 49 | 
 50 | first_interest as (
 51 | 
 52 |     select 
 53 |         email,
 54 |         min(expressed_interest_at) as expressed_interest_at
 55 |     from combined_interest
 56 |     group by email
 57 | 
 58 | )
 59 | 
 60 | select * from first_interest
 61 | ```
 62 | ## Guidelines
 63 | 
 64 | ### Use lowercase SQL
 65 | 
 66 | It's just as readable as uppercase SQL and you won't have to constantly be holding down a shift key.
 67 | 
 68 | ```sql
 69 | -- Good
 70 | select * from users
 71 | 
 72 | -- Bad
 73 | SELECT * FROM users
 74 | 
 75 | -- Bad
 76 | Select * From users
 77 | ```
 78 | 
 79 | ### Put each selected column on its own line
 80 | 
 81 | When selecting columns, always put each column name on its own line and never on the same line as `select`. For multiple columns, it's easier to read when each column is on its own line. And for single columns, it's easier to add additional columns without any reformatting (which you would have to do if the single column name was on the same line as the `select`).
 82 | 
 83 | ```sql
 84 | -- Good
 85 | select 
 86 |     id
 87 | from users 
 88 | 
 89 | -- Good
 90 | select 
 91 |     id,
 92 |     email
 93 | from users 
 94 | 
 95 | -- Bad
 96 | select id
 97 | from users 
 98 | 
 99 | -- Bad
100 | select id, email
101 | from users 
102 | ```
103 | 
104 | ## select *
105 | 
106 | When selecting `*` it's fine to include the `*` next to the `select` and also fine to include the `from` on the same line, assuming no additional complexity like `where` conditions:
107 | 
108 | ```sql
109 | -- Good
110 | select * from users 
111 | 
112 | -- Good too
113 | select *
114 | from users
115 | 
116 | -- Bad
117 | select * from users where email = 'name@example.com'
118 | ```
119 | 
120 | ## Indenting where conditions
121 | 
122 | Similarly, conditions should always be spread across multiple lines to maximize readability and make them easier to add to. Operators should be placed at the end of each line:
123 | 
124 | ```sql
125 | -- Good
126 | select *
127 | from users
128 | where 
129 |     email = 'example@domain.com'
130 | 
131 | -- Good
132 | select *
133 | from users
134 | where 
135 |     email like '%@domain.com' and 
136 |     created_at >= '2021-10-08'
137 | 
138 | -- Bad
139 | select *
140 | from users
141 | where email = 'example@domain.com'
142 | 
143 | -- Bad
144 | select *
145 | from users
146 | where 
147 |     email like '%@domain.com' and created_at >= '2021-10-08'
148 | 
149 | -- Bad
150 | select *
151 | from users
152 | where 
153 |     email like '%@domain.com' 
154 |     and created_at >= '2021-10-08'
155 | ```
156 | 
157 | ### Left align SQL keywords
158 | 
159 | Some IDEs have the ability to automatically format SQL so that the spaces after the SQL keywords are vertically aligned. This is cumbersome to do by hand (and in my opinion harder to read anyway) so I recommend just left aligning all of the keywords:
160 | 
161 | ```sql
162 | -- Good
163 | select 
164 |     id,
165 |     email
166 | from users
167 | where 
168 |     email like '%@gmail.com'
169 | 
170 | -- Bad
171 | select id, email
172 |   from users
173 |  where email like '%@gmail.com'
174 | ```
175 | 
176 | ### Use single quotes
177 | 
178 | Some SQL dialects like BigQuery support using double quotes, but for most dialects double quotes will wind up referring to column names. For that reason, single quotes are preferable:
179 | 
180 | ```sql
181 | -- Good
182 | select *
183 | from users
184 | where 
185 |     email = 'example@domain.com'
186 | 
187 | -- Bad
188 | select *
189 | from users
190 | where 
191 |     email = "example@domain.com"
192 | ```
193 | 
194 | If your SQL dialect supports double quoted strings and you prefer them, just make sure to be consistent and not switch between single and double quotes.
195 | 
196 | ### Use `!=` over `<>`
197 | 
198 | Simply because `!=` reads like "not equal" which is closer to how we'd say it out loud.
199 | 
200 | ```sql
201 | -- Good
202 | select 
203 |     count(*) as paying_users_count
204 | from users
205 | where 
206 |     plan_name != 'free'
207 | ```
208 | 
209 | ### Commas should be at the the end of lines
210 | 
211 | ```sql
212 | -- Good
213 | select
214 |     id,
215 |     email
216 | from users
217 | 
218 | -- Bad
219 | select
220 |     id
221 |     , email
222 | from users
223 | ```
224 | 
225 | While the commas-first style does have some practical advantages (it's easier to spot missing commas and results in cleaner diffs), I'm just not a huge fan of how they look so prefer commas-last.
226 | 
227 | ### Avoid spaces inside of parenthesis
228 | 
229 | ```sql
230 | -- Good
231 | select *
232 | from users
233 | where 
234 |     id in (1, 2)
235 | 
236 | -- Bad
237 | select *
238 | from users
239 | where 
240 |     id in ( 1, 2 )
241 | ```
242 | 
243 | ### Break long lists of `in` values into multiple indented lines
244 | 
245 | ```sql
246 | -- Good
247 | select *
248 | from users
249 | where 
250 |     email in (
251 |         'user-1@example.com',
252 |         'user-2@example.com',
253 |         'user-3@example.com',
254 |         'user-4@example.com'
255 |     )
256 | ```
257 | 
258 | ### Table names should be a plural snake case of the noun
259 | 
260 | ```sql
261 | -- Good
262 | select * 
263 | from users
264 | 
265 | -- Good
266 | select * 
267 | from visit_logs
268 | 
269 | -- Bad
270 | select * 
271 | from user
272 | 
273 | -- Bad
274 | select * 
275 | from visitLog
276 | ```
277 | 
278 | ### Column names should be snake_case
279 | 
280 | ```sql
281 | -- Good
282 | select
283 |     id,
284 |     email,
285 |     timestamp_trunc(created_at, month) as signup_month
286 | from users
287 | 
288 | -- Bad
289 | select
290 |     id,
291 |     email,
292 |     timestamp_trunc(created_at, month) as SignupMonth
293 | from users
294 | ```
295 | 
296 | ### Column name conventions
297 | 
298 | * Boolean fields should be prefixed with `is_`, `has_`, or `does_`. For example, `is_customer`, `has_unsubscribed`, etc.
299 | * Date-only fields should be suffixed with `_date`. For example, `report_date`.
300 | * Date+time fields should be suffixed with `_at`. For example, `created_at`, `posted_at`, etc.
301 | 
302 | ### Column order conventions
303 | 
304 | Put the primary key first, followed by foreign keys, then by all other columns. If the table has any system columns (`created_at`, `updated_at`, `is_deleted`, etc.), put those last.
305 | 
306 | ```sql
307 | -- Good
308 | select
309 |     id,
310 |     name,
311 |     created_at
312 | from users
313 | 
314 | -- Bad
315 | select
316 |     created_at,
317 |     name,
318 |     id,
319 | from users
320 | ```
321 | 
322 | ### Include `inner` for inner joins
323 | 
324 | Better to be explicit so that the join type is crystal clear:
325 | 
326 | ```sql
327 | -- Good
328 | select
329 |     users.email,
330 |     sum(charges.amount) as total_revenue
331 | from users
332 | inner join charges on users.id = charges.user_id
333 | 
334 | -- Bad
335 | select
336 |     users.email,
337 |     sum(charges.amount) as total_revenue
338 | from users
339 | join charges on users.id = charges.user_id
340 | ```
341 | 
342 | ### For join conditions, put the table that was referenced first immediately after the `on`
343 | 
344 | By doing it this way it makes it easier to determine if your join is going to cause the results to fan out:
345 | 
346 | ```sql
347 | -- Good
348 | select
349 |     ...
350 | from users
351 | left join charges on users.id = charges.user_id
352 | -- primary_key = foreign_key --> one-to-many --> fanout
353 |   
354 | select
355 |     ...
356 | from charges
357 | left join users on charges.user_id = users.id
358 | -- foreign_key = primary_key --> many-to-one --> no fanout
359 | 
360 | -- Bad
361 | select
362 |     ...
363 | from users
364 | left join charges on charges.user_id = users.id
365 | ```
366 | 
367 | ### Single join conditions should be on the same line as the join
368 | 
369 | ```sql
370 | -- Good
371 | select
372 |     users.email,
373 |     sum(charges.amount) as total_revenue
374 | from users
375 | inner join charges on users.id = charges.user_id
376 | group by email
377 | 
378 | -- Bad
379 | select
380 |     users.email,
381 |     sum(charges.amount) as total_revenue
382 | from users
383 | inner join charges
384 | on users.id = charges.user_id
385 | group by email
386 | ```
387 | 
388 | When you have mutliple join conditions, place each one on their own indented line:
389 | 
390 | ```sql
391 | -- Good
392 | select
393 |     users.email,
394 |     sum(charges.amount) as total_revenue
395 | from users
396 | inner join charges on 
397 |     users.id = charges.user_id and
398 |     refunded = false
399 | group by email
400 | ```
401 | 
402 | ### Avoid aliasing table names most of the time
403 | 
404 | It can be tempting to abbreviate table names like `users` to `u` and `charges` to `c`, but it winds up making the SQL less readable:
405 | 
406 | ```sql
407 | -- Good
408 | select
409 |     users.email,
410 |     sum(charges.amount) as total_revenue
411 | from users
412 | inner join charges on users.id = charges.user_id
413 | 
414 | -- Bad
415 | select
416 |     u.email,
417 |     sum(c.amount) as total_revenue
418 | from users u
419 | inner join charges c on u.id = c.user_id
420 | ```
421 | 
422 | Most of the time you'll want to type out the full table name.
423 | 
424 | There are two exceptions:
425 | 
426 | If you you need to join to a table more than once in the same query and need to distinguish each version of it, aliases are necessary.
427 | 
428 | Also, if you're working with long or ambiguous table names, it can be useful to alias them (but still use meaningful names):
429 | 
430 | ```sql
431 | -- Good: Meaningful table aliases
432 | select
433 |   companies.com_name,
434 |   beacons.created_at
435 | from stg_mysql_helpscout__helpscout_companies companies
436 | inner join stg_mysql_helpscout__helpscout_beacons_v2 beacons on companies.com_id = beacons.com_id
437 | 
438 | -- OK: No table aliases
439 | select
440 |   stg_mysql_helpscout__helpscout_companies.com_name,
441 |   stg_mysql_helpscout__helpscout_beacons_v2.created_at
442 | from stg_mysql_helpscout__helpscout_companies
443 | inner join stg_mysql_helpscout__helpscout_beacons_v2 on stg_mysql_helpscout__helpscout_companies.com_id = stg_mysql_helpscout__helpscout_beacons_v2.com_id
444 | 
445 | -- Bad: Unclear table aliases
446 | select
447 |   c.com_name,
448 |   b.created_at
449 | from stg_mysql_helpscout__helpscout_companies c
450 | inner join stg_mysql_helpscout__helpscout_beacons_v2 b on c.com_id = b.com_id
451 | ```
452 | 
453 | ### Include the table when there is a join, but omit it otherwise
454 | 
455 | When there are no join involved, there's no ambiguity around which table the columns came from so you can leave the table name out:
456 | 
457 | ```sql
458 | -- Good
459 | select
460 |     id,
461 |     name
462 | from companies
463 | 
464 | -- Bad
465 | select
466 |     companies.id,
467 |     companies.name
468 | from companies
469 | ```
470 | 
471 | But when there are joins involved, it's better to be explicit so it's clear where the columns originated:
472 | 
473 | ```sql
474 | -- Good
475 | select
476 |     users.email,
477 |     sum(charges.amount) as total_revenue
478 | from users
479 | inner join charges on users.id = charges.user_id
480 | 
481 | -- Bad
482 | select
483 |     email,
484 |     sum(amount) as total_revenue
485 | from users
486 | inner join charges on users.id = charges.user_id
487 | 
488 | ```
489 | 
490 | ### Always rename aggregates and function-wrapped arguments
491 | 
492 | ```sql
493 | -- Good
494 | select 
495 |     count(*) as total_users
496 | from users
497 | 
498 | -- Bad
499 | select 
500 |     count(*)
501 | from users
502 | 
503 | -- Good
504 | select 
505 |     timestamp_millis(property_beacon_interest) as expressed_interest_at
506 | from hubspot.contact
507 | where 
508 |     property_beacon_interest is not null
509 | 
510 | -- Bad
511 | select
512 |     timestamp_millis(property_beacon_interest)
513 | from hubspot.contact
514 | where
515 |     property_beacon_interest is not null
516 | ```
517 | 
518 | ### Be explicit in boolean conditions
519 | 
520 | ```sql
521 | -- Good
522 | select * 
523 | from customers 
524 | where 
525 |     is_cancelled = true
526 | 
527 | -- Good
528 | select * 
529 | from customers 
530 | where 
531 |     is_cancelled = false
532 | 
533 | -- Bad
534 | select * 
535 | from customers 
536 | where 
537 |     is_cancelled
538 | 
539 | -- Bad
540 | select * 
541 | from customers 
542 | where 
543 |     not is_cancelled
544 | ```
545 | 
546 | ### Use `as` to alias column names
547 | 
548 | ```sql
549 | -- Good
550 | select
551 |     id,
552 |     email,
553 |     timestamp_trunc(created_at, month) as signup_month
554 | from users
555 | 
556 | -- Bad
557 | select
558 |     id,
559 |     email,
560 |     timestamp_trunc(created_at, month) signup_month
561 | from users
562 | ```
563 | 
564 | ### Group using column names or numbers, but not both
565 | 
566 | I prefer grouping by name, but grouping by numbers is [also fine](https://blog.getdbt.com/write-better-sql-a-defense-of-group-by-1/).
567 | 
568 | ```sql
569 | -- Good
570 | select 
571 |     user_id, 
572 |     count(*) as total_charges
573 | from charges
574 | group by user_id
575 | 
576 | -- Good
577 | select 
578 |     user_id, 
579 |     count(*) as total_charges
580 | from charges
581 | group by 1
582 | 
583 | -- Bad
584 | select
585 |     timestamp_trunc(created_at, month) as signup_month,
586 |     vertical,
587 |     count(*) as users_count
588 | from users
589 | group by 1, vertical
590 | ```
591 | 
592 | ### Take advantage of lateral column aliasing when grouping by name
593 | 
594 | ```sql
595 | -- Good
596 | select
597 |   timestamp_trunc(com_created_at, year) as signup_year,
598 |   count(*) as total_companies
599 | from companies
600 | group by signup_year
601 | 
602 | -- Bad
603 | select
604 |   timestamp_trunc(com_created_at, year) as signup_year,
605 |   count(*) as total_companies
606 | from companies
607 | group by timestamp_trunc(com_created_at, year)
608 | ```
609 | 
610 | ### Grouping columns should go first
611 | 
612 | ```sql
613 | -- Good
614 | select
615 |   timestamp_trunc(com_created_at, year) as signup_year,
616 |   count(*) as total_companies
617 | from companies
618 | group by signup_year
619 | 
620 | -- Bad
621 | select
622 |   count(*) as total_companies,
623 |   timestamp_trunc(com_created_at, year) as signup_year
624 | from mysql_helpscout.helpscout_companies
625 | group by signup_year
626 | ```
627 | 
628 | ### Aligning case/when statements
629 | 
630 | Each `when` should be on its own line (nothing on the `case` line) and should be indented one level deeper than the `case` line. The `then` can be on the same line or on its own line below it, just aim to be consistent.
631 | 
632 | ```sql
633 | -- Good
634 | select
635 |     case
636 |         when event_name = 'viewed_homepage' then 'Homepage'
637 |         when event_name = 'viewed_editor' then 'Editor'
638 |         else 'Other'
639 |     end as page_name
640 | from events
641 | 
642 | -- Good too
643 | select
644 |     case
645 |         when event_name = 'viewed_homepage'
646 |             then 'Homepage'
647 |         when event_name = 'viewed_editor'
648 |             then 'Editor'
649 |         else 'Other'            
650 |     end as page_name
651 | from events
652 | 
653 | -- Bad 
654 | select
655 |     case when event_name = 'viewed_homepage' then 'Homepage'
656 |         when event_name = 'viewed_editor' then 'Editor'
657 |         else 'Other'        
658 |     end as page_name
659 | from events
660 | ```
661 | 
662 | ### Use CTEs, not subqueries
663 | 
664 | Avoid subqueries; CTEs will make your queries easier to read and reason about.
665 | 
666 | When using CTEs, pad the query with new lines. 
667 | 
668 | If you use any CTEs, always `select *` from the last CTE at the end. That way you can quickly inspect the output of other CTEs used in the query to debug the results.
669 | 
670 | Closing CTE parentheses should use the same indentation level as `with` and the CTE names.
671 | 
672 | ```sql
673 | -- Good
674 | with ordered_details as (
675 | 
676 |     select
677 |         user_id,
678 |         name,
679 |         row_number() over (partition by user_id order by date_updated desc) as details_rank
680 |     from billingdaddy.billing_stored_details
681 | 
682 | ),
683 | 
684 | first_updates as (
685 | 
686 |     select 
687 |         user_id, 
688 |         name
689 |     from ordered_details
690 |     where 
691 |         details_rank = 1
692 | 
693 | )
694 | 
695 | select * from first_updates
696 | 
697 | -- Bad
698 | select 
699 |     user_id, 
700 |     name
701 | from (
702 |     select
703 |         user_id,
704 |         name,
705 |         row_number() over (partition by user_id order by date_updated desc) as details_rank
706 |     from billingdaddy.billing_stored_details
707 | ) ranked
708 | where 
709 |     details_rank = 1
710 | ```
711 | 
712 | ### Use meaningful CTE names
713 | 
714 | ```sql
715 | -- Good
716 | with ordered_details as (
717 | 
718 | -- Bad
719 | with d1 as (
720 | ```
721 | 
722 | ### Window functions
723 | 
724 | Leave it all on its own line:
725 | 
726 | ```sql
727 | -- Good
728 | select
729 |     user_id,
730 |     name,
731 |     row_number() over (partition by user_id order by date_updated desc) as details_rank
732 | from billingdaddy.billing_stored_details
733 | 
734 | -- Okay
735 | select
736 |     user_id,
737 |     name,
738 |     row_number() over (
739 |         partition by user_id
740 |         order by date_updated desc
741 |     ) as details_rank
742 | from billingdaddy.billing_stored_details
743 | ```
744 | 
745 | ## Credits
746 | 
747 | This style guide was inspired in part by:
748 | 
749 | * [Fishtown Analytics' dbt Style Guide](https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#sql-style-guide)
750 | * [KickStarter's SQL Style Guide](https://gist.github.com/fredbenenson/7bb92718e19138c20591)
751 | * [GitLab's SQL Style Guide](https://about.gitlab.com/handbook/business-ops/data-team/sql-style-guide/)
752 | 
753 | Hat-tip to Peter Butler, Dan Wyman, Simon Ouderkirk, Alex Cano, Adam Stone, Brian Kim, and Claire Carroll for providing feedback on this guide.
754 | 


--------------------------------------------------------------------------------